SlideShare a Scribd company logo
1 of 12
S-052 Final Examination
1
1(a). The two main atypical cases are those of ID 322 and 700.
For ID 322, the observed outcome variable of general reasoning (GENREAS) has a value
of 2.625 units, which differs from the sample mean of 4.115 units by 1.490 units, by more than
one standard deviation from the mean (SD = 1.109). In fact, although ID 322 is chronically ill and
therefore expected to have a lower GENREAS value than that of a healthy child, ID 322 still
exhibits a GENREAS value 1.118 units lower than the GENREAS mean of chronically ill children
in the sample (3.743), once again, by over one standard deviation lower (SD = 1.011). ID 322 also
differs considerably in its values of the independent variables, age (AGE) and socioeconomic status
(SES), from the respective sample means. ID 322 is 190 months old, placing him/her in the 90-95th
percentile of the sample according to age. His/Her Hollingshead SES value is 5, placing ID 322 in
the 99th percentile of the sample according to socioeconomic status. Because of these extreme
values on both the Y and X axis variables, ID 322 exhibits a discrepancy value of -3.002 (sample
mean = -0.001; SD = 0.609) and a leverage value of 0.055 (sample mean = 0.021; SD = 0.008),
thus giving ID 322 the highest influence of all units in the sample. Its Cook’s D statistic is 0.131
(sample mean = 0.005, SD = 0.011). This value is visually represented in the lvr2plot on page 24
of the evidentiary materials by ID 322’s location in the far upper right corner of the plot and on
the Cook’s D versus Child Identification Code plot on page 25 of the evidentiary materials by ID
322’s isolation in the far upper left corner of the plot.
ID 700, a healthy child, has a GENREAS value of 2.344 units, which differs from the
sample mean by 1.771 units, by more than one standard deviation. Compared to the GENREAS
mean for only healthy children (4.512), ID 700 still differs by 2.168 units, again, by more than one
standard deviation (SD=1.074). ID 700 is 68 months old, placing him/her in the 5-10th percentile
of the sample according to age. ID 700 has a Hollingshead SES value of 4, placing him/her in the
90-95th percentile of the sample. Although ID 700 does not display atypical discrepancy when
viewing the appropriate tables and plots in the evidentiary materials (pp. 20-21), he/she does
display the most atypical leverage (0.613; SD = 0.008) as shown graphically on the lvr2plot on
page 24 of the evidentiary materials. This may not lead to sufficient influence on the estimated
coefficients in the regression model; however, high leverage can lead to unpredictable impacts
such as erratic SSE, MSE, RMSE, R2, standard error, t-statistic, and p-values as well as erratic
impacts on hypothesis testing (unit 1c – slide 5).
S-052 Final Examination
2
1(b). As mentioned above, ID 700 does not have atypical influence as measured by the Cook’s
D statistic, and therefore, the estimated regression coefficients of model I1 are not sensitive to its
presence. In contrast, ID 322 has the most influence of all observed units, and hence, the estimated
regression coefficients of model I1 are sensitive to its presence. I would have liked to run a
sensitivity analysis that would have excluded these atypical cases from comparable regression
models in order to check the cases’ impacts on other model statistics; however, given the
information presented in the evidentiary materials, I can only conclude that model I1 is sensitive
to ID 322’s inclusion but not ID 700’s inclusion.
1(c). Since ID 322’s observed independent variable values are both considerably higher than
their respective sample means, and since its observed outcome variable value is considerably lower
than the sample mean (see response 1a for specific figures), the direction of ID 322’s influence on
the estimated coefficients for those independent variables in the fitted model is negative. In other
words, including ID 322 in the fitted model lowers the coefficients for both AGE and SES. One
can see in the bivariate plot on page 16 of the evidentiary materials that the position of ID 322 in
the far lower right corner would pull the line of best fit down towards it, negatively influencing
the AGE coefficient. This is also apparent in the bivariate plot of General Reasoning versus
Hollingshead SES on page 17 of the evidentiary materials. ID 322’s position in the far lower right
corner would pull the line of best fit down towards it, thus having an influence that would decrease
the coefficient of SES in the fitted regression model.
2.
Intraclass correlation for model II1:
𝜌̂0 =
𝜎̂𝑢,0
2
𝜎̂ 𝑢,0
2 + 𝜎̂ 𝑒,0
2
=
0.361
(0.361 + 1.525)
= 0.191
Intraclass correlation for model II2B:
𝜌̂1 =
𝜎̂𝑢,1
2
𝜎̂𝑢,1
2 + 𝜎̂ 𝑒,1
2
=
0.230
(0.230 + 1.440)
= 0.138
S-052 Final Examination
3
Statistically, the intraclass correlation is the proportion of total variation that is attributable
to between-group differences. It can be estimated by dividing between-group variance (𝜎̂ 𝑢
2
) by the
sum of between-group variance and within-group variance (𝜎̂𝑒
2
). Since the decrease in between-
group variance (𝜎̂𝑢,0
2
− 𝜎̂𝑢,1
2
= 0.131) is greater than that of within-group variance (𝜎̂𝑒,0
2
− 𝜎̂ 𝑒,1
2
=
0.085) from model II1 to model II2B, the resulting proportion represented by the intraclass
correlation (𝜌̂) also decreases (𝜌̂0 − 𝜌̂1 = 0.053).
In substantive terms, the addition of the level-two control variable, STRICT explains away
more of the variation in the giving of log dollars between churches of different doctrines than does
the addition of the level-one dichotomous age control variables in the giving of log dollars within
churches. Therefore, since more variation is explained away at the between-church level, the
proportion of total variation as explained by between-church differences lessens.
3(a). The GLH test here tests the null hypothesis that the inclusion of these dichotomous age-
group variables in model II2B does not result in a significantly better fit than the more
parsimonious model that omits them.
𝐻0: 𝛽𝐴𝐺𝐸28 = 0 and 𝛽𝐴𝐺𝐸33 = 0 and 𝛽𝐴𝐺𝐸38 = 0 and 𝛽𝐴𝐺𝐸43 = 0 and 𝛽𝐴𝐺𝐸48 = 0
𝐻 𝑎: 𝑎𝑛𝑦 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒𝑠𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑧𝑒𝑟𝑜
The evidentiary materials illustrate that a 𝜒2
test at 𝛼 = 0.05 with 5 degrees of freedom returns a
test statistic of 277.59 (p<0.001). I, therefore, reject the null hypothesis and conclude that the
coefficients on these dichotomous age-group variables are non-zero in the population. In other
words, age is a statistically significant predictor of giving in log dollars, thus warranting inclusion
in a model that attempts to estimate church giving in log dollars. By including these dichotomous
age-group variables, model II2B fits significantly better than the nested, more parsimonious model
that omits them.
3(b). Because model II2B is a random-effects model, we can make comparisons between groups
as well as within groups; therefore, the estimated coefficient on the predictor AGE48 can be
interpreted in the following way: In the population, church members in the 48-year-old category
S-052 Final Examination
4
give, on average, 1.284 more log dollars than church members in the 22-year-old category when
controlling for church doctrine.
4(a). We can use likelihood ratio tests to determine “goodness of fit” between nested models.
To confirm that the two-way interaction of member age and church doctrine is required, I test the
null hypothesis that there is no significant difference in the fit of model II3B to that of model II2B
by using a 𝜒2
test on the difference in deviances and degrees of freedom for the two models:
𝜒(5)
2
= 15216.1 − 15194.0 = 22.1
𝜒( 𝑑𝑓=5,𝑎=0.05)
2
= 11.07
Because the 𝜒2
test statistic of 22.1 is greater than the critical value of 11.07 at the 𝛼 = 0.05 level
with 5 degrees of freedom, I reject the null hypothesis and conclude that model II3B provides a
significantly better fit than model II2B.
4(b). To illustrate the interpretation of the coefficient on the STRICTxAGE48 interaction, let us
choose two prototypical church members. Sally is a 48-year-old member of an evangelical church,
and Frank is a 48-year-old member of a non-evangelical church. By placing their prototypical
values into the fitted regression of model II3B, we can estimate their log-giving.
𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ = 5.352 + 1.011𝑆𝑇𝑅𝐼𝐶𝑇̂ + 0.431𝐴𝐺𝐸28̂ + 0.911𝐴𝐺𝐸33̂ + 1.048𝐴𝐺𝐸38̂
+ 1.386𝐴𝐺𝐸43̂ + 1.456𝐴𝐺𝐸48̂ − 0.000754𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸28̂
− 0.179𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸33̂ − 0.152𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸38̂ − 0.543𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸43̂
− 0.326𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸48̂ (1)
𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ 𝑆𝑎𝑙𝑙𝑦 = 5.352 + 1.011 + 1.456 − 0.326 = 7.493 (2)
𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ 𝐹𝑟𝑎𝑛𝑘 = 5.352 + 1.456 = 6.808 (3)
𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ 𝑆𝑎𝑙𝑙𝑦−𝐹𝑟𝑎𝑛𝑘 = 7.493 − 6.808 = 0.685 (4)
As seen in equation (2), the inclusion of the STRICTxAGE48 interaction decreases the
association of being an evangelical church member on giving in log dollars (as shown by the 1.011
coefficient on the variable STRICT) by 0.326, on average, in the population. It can also be said that
S-052 Final Examination
5
the interaction decreases the association of being categorized in the 48 year-old age bracket on
giving in log dollars (as shown by the 1.456 coefficient on AGE48) by 0.326, on average, in the
population.
Therefore, in the population, an evangelical church member in the 48-year-old category,
on average, gives 0.685, not 1.011, more log dollars than a non-evangelical church member in the
48 year-old category as shown in equation (4).
5(a). This model is the total-regression model. Although it is sufficient to estimate individual-
level coefficients, it ignores group membership, thus leading to correlation among residuals. This
is not the case in a random-effects model, which assumes that residuals at both the individual-level
and the group-level are independent and normally distributed.
5(b). This model is a fixed-effects model. This model does consider group membership.
However, unlike the random-effects model, the fixed-effects model accounts for group-level
residuals by estimating multiple, group-specific intercepts. Consequently, inferences about
coefficients on level-two variables may not be made; therefore, comparisons between different
groups may not be made. As mentioned earlier, a random-effects model, in contrast, assumes that
group-level population residuals are normally distributed and independent; hence, it is possible to
infer level-two variable coefficients and compare different groups.
6(a).
𝐿𝑜𝑔𝑖𝑡(𝑃(𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1)) = 𝛽0 + 𝛽1 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 + 𝛽2 𝑆𝐸𝑆 + 𝛽3 𝑀𝐼𝑁𝑥𝑆𝐸𝑆
6(b).
𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1)) = 𝛽0 + 𝛽2
𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1)) = 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3
𝛽0 + 𝛽2 = 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3
0 = 𝛽1 + 𝛽3
I test that 𝛽1 + 𝛽3 = 0. According to the first GLH test result on page 52 of the evidentiary
materials, the 𝜒2
test statistic is 18.86 (p<0.001). I, therefore, reject the null hypothesis that there
S-052 Final Examination
6
is no difference in the log-odds of college-going between minority and non-minority adolescent
males of low socioeconomic status (SES=1), on average, in the population.
6(c).
𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5)) = 𝛽0 + 𝛽2 ∗ 4.5
𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5)) = 𝛽0 + 𝛽1 + 𝛽2 ∗ 4.5 + 𝛽3 ∗ 4.5
𝛽0 + 𝛽2 ∗ 4.5 = 𝛽0 + 𝛽1 + 𝛽2 ∗ 4.5 + 𝛽3 ∗ 4.5
0 = 𝛽1 + 𝛽3 ∗ 4.5
Here, I test that 𝛽1 + 𝛽3 ∗ 4.5 = 0. According to the second GLH test result on page 54 of
the evidentiary materials, the 𝜒2
test statistic is 12.10 (p<0.001). I, therefore, reject the null
hypothesis that there is no difference in the log-odds of college-going between minority and non-
minority adolescent males of high socioeconomic status (SES=4.5), on average, in the population.
7(a).
Fitted model in logit space:
𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1)) = −4.838 + 1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 + 1.502𝑆𝐸𝑆 − 0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆
Fitted model in probability space:
𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1) =
1
1 + 𝑒−(−4.838+1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌+1.502𝑆𝐸𝑆 −0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆)
Estimated log-odds and probability of a non-minority adolescent male of low socioeconomic status
going to college:
𝐿𝑜𝑔𝑖𝑡 (𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1)) = −3.336
𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1) = 0.034
Estimated log-odds and probability of a minority adolescent male of low socioeconomic status
going to college:
S-052 Final Examination
7
𝐿𝑜𝑔𝑖𝑡 (𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1)) = −2.374
𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1) = 0.085
Estimated odds-ratio for a minority adolescent male of low socioeconomic status going to college
versus a non-minority adolescent male of low socioeconomic status going to college:
𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1)
𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1)
=
0.085
1 − 0.085
0.034
1 − 0.034
=
0.093
0.036
= 2.617
7(b).
Fitted model in logit space:
𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1)) = −4.838 + 1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 + 1.502𝑆𝐸𝑆 − 0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆
Fitted model in probability space:
𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1) =
1
1 + 𝑒−(−4.838+1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌+1.502𝑆𝐸𝑆 −0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆)
Estimated log-odds and probability of a non-minority adolescent male of high socioeconomic
status going to college:
𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5)) = 1.921
𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5) = 0.872
Estimated log-odds and probability of a minority adolescent male of high socioeconomic status
going to college:
𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5)) = 1.196
𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5) = 0.768
Estimated odds-ratio for a minority adolescent male of high socioeconomic status going to college
versus a non-minority adolescent male of high socioeconomic status going to college:
S-052 Final Examination
8
𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5)
𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5)
=
0.768
1 − 0.768
0.872
1 − 0.872
=
3.307
6.828
= 0.484
7(c). Probabilities for going to college differ between white and minority adolescent males, not
just according to race, but also according to socioeconomic status. As expected, college-going
rates increase for both white and minority male adolescents as their socioeconomic status
increases. However, there are some surprising racial differences in how these rates change. When
comparing male students of only low socioeconomic backgrounds (SES=1), the estimated
probability for going to college is greater for minority students (8.5%) than for whites (3.4%). In
fact, on average, the odds of a minority adolescent male’s going to college are higher by a factor
of 2.617 when compared to the odds of a white adolescent male’s going to college when the
students are of low socioeconomic level. As socioeconomic status increases, however, the
difference between their estimated probabilities decreases, as seen in the narrowing space between
the solid and dashed curves above SES values of 1 and 3 in the graph on page 51 of the evidentiary
materials. At some point past SES=3, the fitted curves cross, and the estimated probability of going
to college becomes higher for whites than for minority adolescent males. For instance, at a high
socioeconomic level (SES=4.5), the probability of a minority adolescent male’s going to college
is 76.8% while that of a white adolescent male’s going to college is 87.2%. At SES=4.5, the odds
ratio of a minority adolescent male’s going to college versus a white adolescent male’s going to
college is only 0.484.
In order to ensure that these college-going differences existed in the population and not
only in the sample, I utilized a couple of post-hoc GLH tests. The first tested the null hypothesis
that there is no difference in the population between the log-odds of a white and minority student’s
going to college at a low socioeconomic level (SES=1). The second tested the null hypothesis that
there is no difference in the population between the log-odds of a white and minority student’s
going to college at a high socioeconomic level (SES=4.5). In both tests, the null hypothesis was
rejected:
Test 1 at SES=1, 𝜒(1)
2
= 18.86 (𝑝 < 0.001)
Test 2 at SES=4.5, 𝜒(1)
2
= 12.10 (𝑝 < 0.001)
S-052 Final Examination
9
I, therefore, concluded that there are indeed significant differences in the population between the
log-odds of white and minority adolescent males’ going to college at both low and high
socioeconomic levels, not just in the sample.
8(a).
Week
Fitted Hazard
Logit
Fitted Hazard
Probability
Fitted Survival
Probability
2 -1.619 0.165342829 0.834657171
3 -1.6982 0.154700502 0.705535288
4 -2.042 0.114863236 0.624495222
5 -2.564 0.071491565 0.579849081
6 -2.583 0.070240558 0.539120159
8(b). To find the fitted hazard logit for week 2, I summed the constant (-1.682) and the
coefficient for FEMALE (0.0630) since the constant serves as the reference for both the
coefficients on the week dummy variables and the female variable. This generated a value of -
1.619. Then to find the fitted hazard probability, I used the following equation:
𝑝 =
1
1 + 𝑒−𝑙𝑜𝑔𝑖𝑡
=
1
1 + 𝑒−(−1.619) = 0.165
To find the fitted survival probability, I simply subtracted the fitted hazard probability from 1,
giving me 0.835.
In order to determine the fitted hazard logit for week 3, I summed the constant (-1.682),
the coefficient for FEMALE (0.0630), and the coefficient for the week 3 dummy variable (-
0.0792). This gave me a value of -1.6982. I then calculated the fitted hazard probability by
plugging this value into the logit to probability conversion equation:
𝑝 =
1
1 + 𝑒−𝑙𝑜𝑔𝑖𝑡
=
1
1 + 𝑒−(−1.6982) = 0.155
Finally, to find the fitted survival probability of week 3, I multiplied the difference between 1 and
the fitted hazard probability of week 3 with the fitted survival probability of week 2:
(1 − 0.155) ∗ 0.835 = 0.706
S-052 Final Examination
10
8(c). I estimate the Kaplan-Meier 67th percentile survival time to occur during the fourth relative
week of enrollment. Given that the fitted survival probability of week 3 for female students is
70.6% (i.e. that 70.6% of female students will log activity not only in their third but also in their
fourth week of enrollment) and that the fitted survival probability of week 4 is 62.4%, it is
reasonable to estimate that by sometime in the their fourth week of enrollment, 33% of female
students who enrolled in the first three absolute weeks of the course and who did not drop out in
their first relative week will have logged their last activity.
9(a). Whereas principal component analysis weights items in order to maximize variance and
factor analysis weights items accounting for their reliability, sum-score simply weights each item
equally. Consequently, one would not be able to take advantage of inter-item correlations or item
reliability to help explain the story behind the items. For example, in the Gambia test, items that
indicate ownership of expensive products like a good roof or car may have been intentionally
included under a theoretical motivation to measure a family’s overall net worth by placing more
“importance” in the ownership of these expensive products. If a sum-score method were utilized,
the items indicating ownership of expensive products would be weighted equally to items that
indicate ownership of inexpensive products (such as radios), thus contradicting the original intent
of the researchers. Even if it turned out that the researchers’ theoretical assumptions in the creation
and use of their items were misguided, such as the case of the low weight placed on CAR in the
principal components analysis of the Gambia context, using a sum-score method would not
identify those errors, unlike a principal components analysis or factor analysis.
Moreover, because one does not account for item reliability by using a sum-score,
replication of items is not possible, therefore leaving researchers with only one version of the test.
Having only one version means that researchers may not really know what their items are asking.
We are discovering more and more the challenges of composing test items when it comes to
construct validity. Often we cannot be certain what it is we are measuring with a specific item.
Only through replication of items can we refine processes that attempt to ensure the construct
validity of test items.
9(b). According to the principal components table on page 66 of the evidentiary materials, the
indicators RADIO and CAR have the lowest weights (0.2345 and 0.1679, respectively) among all
S-052 Final Examination
11
9 indicators in component 1. This is most likely due to the low item-rest correlations for RADIO
and CAR (0.2967 and 0.1783, respectively) as illustrated in the first table on page 65 of the
evidentiary materials. If we consider RADIO’s low item-rest correlation as due to a radio’s
obsolescence and CAR’s low item-rest correlation as due to an automobile’s expensive cost, we
may then interpret component 1 as a measurement of a family’s preference to spend money in
order to save money. All the other indicators are weighted similarly, and all of them in some way
or another save money or generate more money. All the structural improvements, of course,
increase the value of the family’s home. The TV and the refrigerator also save money in that a
family can stay home for entertainment and preserve their food for longer periods of time. In
contrast, a radio may be inexpensive and entertaining, but its usefulness is increasingly outdated;
hence, its weight is positive yet smaller than the weights of the aforementioned items. A car may
save some money when traveling, but overall, it is a drain on financial resources due to
maintenance costs and depreciation; therefore, its weight is the smallest of all the indicators.
In contrast, RADIO and CAR have the two heaviest weights in component 2 among all the
indicators. Indicators for ownership of a television and refrigerator are also positively weighted
while indicators for possession of structural improvements are negatively weighted in this
component. As a result, I interpret this component as a measure of a family’s ability to pay for
energy, which I believe to be an interpretation uncorrelated with that of component 1. Televisions,
refrigerators, and radios need electricity to operate. Automobiles need gasoline. I assume that only
families that can afford to pay for these energy sources would also buy these products, lest they
become useless.
I used the third table on page 65 and the scree plot on page 66 of the evidentiary materials
to assist me in constructing a bi-dimensional basis for measuring a Gambian family’s financial
position. The table demonstrates that component 1 has an eigenvalue of 5.488, which accounts for
60.98% of the original, standardized variance of the nine indicators (
5.488
9
= .6098). Component
2 has an eigenvalue of 1.453, meaning that it accounts for 1.453 of the remaining 3.512 units (9 −
5.488 = 3.512) of standardized variance. According to the table, component 2 accounts for
16.14% of the original standardized variance. Together, components 1 and 2 account for 77.12%
of the original standardized variance.
While inspecting the scree plot, I notice that the “elbow rule” directs me to keep the first
two components in constructing my basis for measuring a family’s financial position in the Gambia
S-052 Final Examination
12
(Note: I consider the “elbow” to be at the 3rd component, and so I choose to keep the first two
components). Furthermore, this decision is supported by the “rule of one”; the eigenvalues of
components 1 and 2 are both greater than one. Therefore, my final principal components model
has a bidimensional basis for measuring Gambian families’ financial positions.

More Related Content

Viewers also liked

Autoshow marketing
Autoshow marketingAutoshow marketing
Autoshow marketingSeckinArici
 
Music consultant performance appraisal
Music consultant performance appraisalMusic consultant performance appraisal
Music consultant performance appraisalEmmanuelPetit678
 
Traditional Activities at Canyon Creek Summer Camp
Traditional Activities at Canyon Creek Summer CampTraditional Activities at Canyon Creek Summer Camp
Traditional Activities at Canyon Creek Summer CampCanyon Creek Summer Camp
 
Nightclub manager performance appraisal
Nightclub manager performance appraisalNightclub manager performance appraisal
Nightclub manager performance appraisalEmmanuelPetit678
 
Silva_Jeffrey_A123_Final
Silva_Jeffrey_A123_FinalSilva_Jeffrey_A123_Final
Silva_Jeffrey_A123_FinalJeffrey Silva
 
Tomorrow's Technology Now! (Module 3)
Tomorrow's Technology Now! (Module 3)Tomorrow's Technology Now! (Module 3)
Tomorrow's Technology Now! (Module 3)Jim Flakker
 
Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...
Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...
Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...adjoiningfeud6601
 
EL SISTEMA CARDIOVASCULAR HUMANO
EL SISTEMA CARDIOVASCULAR HUMANOEL SISTEMA CARDIOVASCULAR HUMANO
EL SISTEMA CARDIOVASCULAR HUMANObriseida-manayay
 
2 группа ЛЕТО
2 группа ЛЕТО2 группа ЛЕТО
2 группа ЛЕТОdima-jeep
 

Viewers also liked (17)

06259468
0625946806259468
06259468
 
Autoshow marketing
Autoshow marketingAutoshow marketing
Autoshow marketing
 
Music consultant performance appraisal
Music consultant performance appraisalMusic consultant performance appraisal
Music consultant performance appraisal
 
Traditional Activities at Canyon Creek Summer Camp
Traditional Activities at Canyon Creek Summer CampTraditional Activities at Canyon Creek Summer Camp
Traditional Activities at Canyon Creek Summer Camp
 
Nightclub manager performance appraisal
Nightclub manager performance appraisalNightclub manager performance appraisal
Nightclub manager performance appraisal
 
Silva_Jeffrey_A123_Final
Silva_Jeffrey_A123_FinalSilva_Jeffrey_A123_Final
Silva_Jeffrey_A123_Final
 
Gruppa12
Gruppa12Gruppa12
Gruppa12
 
Tomorrow's Technology Now! (Module 3)
Tomorrow's Technology Now! (Module 3)Tomorrow's Technology Now! (Module 3)
Tomorrow's Technology Now! (Module 3)
 
Decisiones que se deben tomar en un proyecto
Decisiones que se deben tomar en un proyectoDecisiones que se deben tomar en un proyecto
Decisiones que se deben tomar en un proyecto
 
arnel
arnelarnel
arnel
 
Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...
Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...
Caruso Affiliated Presents Its First “The Spirit of American Youth Scholarshi...
 
02 Dasar-dasar Desain
02 Dasar-dasar Desain02 Dasar-dasar Desain
02 Dasar-dasar Desain
 
Matemática Básica
Matemática Básica Matemática Básica
Matemática Básica
 
EL SISTEMA CARDIOVASCULAR HUMANO
EL SISTEMA CARDIOVASCULAR HUMANOEL SISTEMA CARDIOVASCULAR HUMANO
EL SISTEMA CARDIOVASCULAR HUMANO
 
Hawk cfc af-rollcoat-brochure
Hawk cfc af-rollcoat-brochureHawk cfc af-rollcoat-brochure
Hawk cfc af-rollcoat-brochure
 
05751669
0575166905751669
05751669
 
2 группа ЛЕТО
2 группа ЛЕТО2 группа ЛЕТО
2 группа ЛЕТО
 

Similar to Applied Data Analysis Final

Add slides
Add slidesAdd slides
Add slidesRupa D
 
Statistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionStatistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionSindhujanDhayalan
 
Statistical Analysis
Statistical AnalysisStatistical Analysis
Statistical AnalysisAbeera Saleem
 
Statistical Analysis
Statistical AnalysisStatistical Analysis
Statistical AnalysisAbeera Saleem
 
Introduction to Statistics Part A - Outputs 1. A sa.docx
Introduction to Statistics Part A - Outputs 1. A sa.docxIntroduction to Statistics Part A - Outputs 1. A sa.docx
Introduction to Statistics Part A - Outputs 1. A sa.docxmariuse18nolet
 
Intergenerational mobility, intergenerational effects, the role of family bac...
Intergenerational mobility, intergenerational effects, the role of family bac...Intergenerational mobility, intergenerational effects, the role of family bac...
Intergenerational mobility, intergenerational effects, the role of family bac...Stockholm Institute of Transition Economics
 
Statistical analysis of some socioeconomic factors affecting age at marriage ...
Statistical analysis of some socioeconomic factors affecting age at marriage ...Statistical analysis of some socioeconomic factors affecting age at marriage ...
Statistical analysis of some socioeconomic factors affecting age at marriage ...Alexander Decker
 
Excel Practice 2 Alexa Mancillas EC.docx
Excel Practice 2  Alexa Mancillas EC.docxExcel Practice 2  Alexa Mancillas EC.docx
Excel Practice 2 Alexa Mancillas EC.docxcravennichole326
 
11.soc io economicfactors affecting age at marriage
11.soc io economicfactors affecting age at marriage11.soc io economicfactors affecting age at marriage
11.soc io economicfactors affecting age at marriageAlexander Decker
 
Soc io economicfactors affecting age at marriage
Soc io economicfactors affecting age at marriageSoc io economicfactors affecting age at marriage
Soc io economicfactors affecting age at marriageAlexander Decker
 
Evaluating welfare and economic effects of raised fertility
Evaluating welfare and economic effects of raised fertilityEvaluating welfare and economic effects of raised fertility
Evaluating welfare and economic effects of raised fertilityGRAPE
 
V2 Makenzie QoL Poster Final Version (1)
V2 Makenzie QoL Poster Final Version (1)V2 Makenzie QoL Poster Final Version (1)
V2 Makenzie QoL Poster Final Version (1)Makenzie Zidek
 
A Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USA Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USEugene Yan Ziyou
 

Similar to Applied Data Analysis Final (14)

Final assesment QRM
Final assesment QRMFinal assesment QRM
Final assesment QRM
 
Add slides
Add slidesAdd slides
Add slides
 
Statistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionStatistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic Regression
 
Statistical Analysis
Statistical AnalysisStatistical Analysis
Statistical Analysis
 
Statistical Analysis
Statistical AnalysisStatistical Analysis
Statistical Analysis
 
Introduction to Statistics Part A - Outputs 1. A sa.docx
Introduction to Statistics Part A - Outputs 1. A sa.docxIntroduction to Statistics Part A - Outputs 1. A sa.docx
Introduction to Statistics Part A - Outputs 1. A sa.docx
 
Intergenerational mobility, intergenerational effects, the role of family bac...
Intergenerational mobility, intergenerational effects, the role of family bac...Intergenerational mobility, intergenerational effects, the role of family bac...
Intergenerational mobility, intergenerational effects, the role of family bac...
 
Statistical analysis of some socioeconomic factors affecting age at marriage ...
Statistical analysis of some socioeconomic factors affecting age at marriage ...Statistical analysis of some socioeconomic factors affecting age at marriage ...
Statistical analysis of some socioeconomic factors affecting age at marriage ...
 
Excel Practice 2 Alexa Mancillas EC.docx
Excel Practice 2  Alexa Mancillas EC.docxExcel Practice 2  Alexa Mancillas EC.docx
Excel Practice 2 Alexa Mancillas EC.docx
 
11.soc io economicfactors affecting age at marriage
11.soc io economicfactors affecting age at marriage11.soc io economicfactors affecting age at marriage
11.soc io economicfactors affecting age at marriage
 
Soc io economicfactors affecting age at marriage
Soc io economicfactors affecting age at marriageSoc io economicfactors affecting age at marriage
Soc io economicfactors affecting age at marriage
 
Evaluating welfare and economic effects of raised fertility
Evaluating welfare and economic effects of raised fertilityEvaluating welfare and economic effects of raised fertility
Evaluating welfare and economic effects of raised fertility
 
V2 Makenzie QoL Poster Final Version (1)
V2 Makenzie QoL Poster Final Version (1)V2 Makenzie QoL Poster Final Version (1)
V2 Makenzie QoL Poster Final Version (1)
 
A Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USA Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the US
 

Applied Data Analysis Final

  • 1. S-052 Final Examination 1 1(a). The two main atypical cases are those of ID 322 and 700. For ID 322, the observed outcome variable of general reasoning (GENREAS) has a value of 2.625 units, which differs from the sample mean of 4.115 units by 1.490 units, by more than one standard deviation from the mean (SD = 1.109). In fact, although ID 322 is chronically ill and therefore expected to have a lower GENREAS value than that of a healthy child, ID 322 still exhibits a GENREAS value 1.118 units lower than the GENREAS mean of chronically ill children in the sample (3.743), once again, by over one standard deviation lower (SD = 1.011). ID 322 also differs considerably in its values of the independent variables, age (AGE) and socioeconomic status (SES), from the respective sample means. ID 322 is 190 months old, placing him/her in the 90-95th percentile of the sample according to age. His/Her Hollingshead SES value is 5, placing ID 322 in the 99th percentile of the sample according to socioeconomic status. Because of these extreme values on both the Y and X axis variables, ID 322 exhibits a discrepancy value of -3.002 (sample mean = -0.001; SD = 0.609) and a leverage value of 0.055 (sample mean = 0.021; SD = 0.008), thus giving ID 322 the highest influence of all units in the sample. Its Cook’s D statistic is 0.131 (sample mean = 0.005, SD = 0.011). This value is visually represented in the lvr2plot on page 24 of the evidentiary materials by ID 322’s location in the far upper right corner of the plot and on the Cook’s D versus Child Identification Code plot on page 25 of the evidentiary materials by ID 322’s isolation in the far upper left corner of the plot. ID 700, a healthy child, has a GENREAS value of 2.344 units, which differs from the sample mean by 1.771 units, by more than one standard deviation. Compared to the GENREAS mean for only healthy children (4.512), ID 700 still differs by 2.168 units, again, by more than one standard deviation (SD=1.074). ID 700 is 68 months old, placing him/her in the 5-10th percentile of the sample according to age. ID 700 has a Hollingshead SES value of 4, placing him/her in the 90-95th percentile of the sample. Although ID 700 does not display atypical discrepancy when viewing the appropriate tables and plots in the evidentiary materials (pp. 20-21), he/she does display the most atypical leverage (0.613; SD = 0.008) as shown graphically on the lvr2plot on page 24 of the evidentiary materials. This may not lead to sufficient influence on the estimated coefficients in the regression model; however, high leverage can lead to unpredictable impacts such as erratic SSE, MSE, RMSE, R2, standard error, t-statistic, and p-values as well as erratic impacts on hypothesis testing (unit 1c – slide 5).
  • 2. S-052 Final Examination 2 1(b). As mentioned above, ID 700 does not have atypical influence as measured by the Cook’s D statistic, and therefore, the estimated regression coefficients of model I1 are not sensitive to its presence. In contrast, ID 322 has the most influence of all observed units, and hence, the estimated regression coefficients of model I1 are sensitive to its presence. I would have liked to run a sensitivity analysis that would have excluded these atypical cases from comparable regression models in order to check the cases’ impacts on other model statistics; however, given the information presented in the evidentiary materials, I can only conclude that model I1 is sensitive to ID 322’s inclusion but not ID 700’s inclusion. 1(c). Since ID 322’s observed independent variable values are both considerably higher than their respective sample means, and since its observed outcome variable value is considerably lower than the sample mean (see response 1a for specific figures), the direction of ID 322’s influence on the estimated coefficients for those independent variables in the fitted model is negative. In other words, including ID 322 in the fitted model lowers the coefficients for both AGE and SES. One can see in the bivariate plot on page 16 of the evidentiary materials that the position of ID 322 in the far lower right corner would pull the line of best fit down towards it, negatively influencing the AGE coefficient. This is also apparent in the bivariate plot of General Reasoning versus Hollingshead SES on page 17 of the evidentiary materials. ID 322’s position in the far lower right corner would pull the line of best fit down towards it, thus having an influence that would decrease the coefficient of SES in the fitted regression model. 2. Intraclass correlation for model II1: 𝜌̂0 = 𝜎̂𝑢,0 2 𝜎̂ 𝑢,0 2 + 𝜎̂ 𝑒,0 2 = 0.361 (0.361 + 1.525) = 0.191 Intraclass correlation for model II2B: 𝜌̂1 = 𝜎̂𝑢,1 2 𝜎̂𝑢,1 2 + 𝜎̂ 𝑒,1 2 = 0.230 (0.230 + 1.440) = 0.138
  • 3. S-052 Final Examination 3 Statistically, the intraclass correlation is the proportion of total variation that is attributable to between-group differences. It can be estimated by dividing between-group variance (𝜎̂ 𝑢 2 ) by the sum of between-group variance and within-group variance (𝜎̂𝑒 2 ). Since the decrease in between- group variance (𝜎̂𝑢,0 2 − 𝜎̂𝑢,1 2 = 0.131) is greater than that of within-group variance (𝜎̂𝑒,0 2 − 𝜎̂ 𝑒,1 2 = 0.085) from model II1 to model II2B, the resulting proportion represented by the intraclass correlation (𝜌̂) also decreases (𝜌̂0 − 𝜌̂1 = 0.053). In substantive terms, the addition of the level-two control variable, STRICT explains away more of the variation in the giving of log dollars between churches of different doctrines than does the addition of the level-one dichotomous age control variables in the giving of log dollars within churches. Therefore, since more variation is explained away at the between-church level, the proportion of total variation as explained by between-church differences lessens. 3(a). The GLH test here tests the null hypothesis that the inclusion of these dichotomous age- group variables in model II2B does not result in a significantly better fit than the more parsimonious model that omits them. 𝐻0: 𝛽𝐴𝐺𝐸28 = 0 and 𝛽𝐴𝐺𝐸33 = 0 and 𝛽𝐴𝐺𝐸38 = 0 and 𝛽𝐴𝐺𝐸43 = 0 and 𝛽𝐴𝐺𝐸48 = 0 𝐻 𝑎: 𝑎𝑛𝑦 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒𝑠𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑧𝑒𝑟𝑜 The evidentiary materials illustrate that a 𝜒2 test at 𝛼 = 0.05 with 5 degrees of freedom returns a test statistic of 277.59 (p<0.001). I, therefore, reject the null hypothesis and conclude that the coefficients on these dichotomous age-group variables are non-zero in the population. In other words, age is a statistically significant predictor of giving in log dollars, thus warranting inclusion in a model that attempts to estimate church giving in log dollars. By including these dichotomous age-group variables, model II2B fits significantly better than the nested, more parsimonious model that omits them. 3(b). Because model II2B is a random-effects model, we can make comparisons between groups as well as within groups; therefore, the estimated coefficient on the predictor AGE48 can be interpreted in the following way: In the population, church members in the 48-year-old category
  • 4. S-052 Final Examination 4 give, on average, 1.284 more log dollars than church members in the 22-year-old category when controlling for church doctrine. 4(a). We can use likelihood ratio tests to determine “goodness of fit” between nested models. To confirm that the two-way interaction of member age and church doctrine is required, I test the null hypothesis that there is no significant difference in the fit of model II3B to that of model II2B by using a 𝜒2 test on the difference in deviances and degrees of freedom for the two models: 𝜒(5) 2 = 15216.1 − 15194.0 = 22.1 𝜒( 𝑑𝑓=5,𝑎=0.05) 2 = 11.07 Because the 𝜒2 test statistic of 22.1 is greater than the critical value of 11.07 at the 𝛼 = 0.05 level with 5 degrees of freedom, I reject the null hypothesis and conclude that model II3B provides a significantly better fit than model II2B. 4(b). To illustrate the interpretation of the coefficient on the STRICTxAGE48 interaction, let us choose two prototypical church members. Sally is a 48-year-old member of an evangelical church, and Frank is a 48-year-old member of a non-evangelical church. By placing their prototypical values into the fitted regression of model II3B, we can estimate their log-giving. 𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ = 5.352 + 1.011𝑆𝑇𝑅𝐼𝐶𝑇̂ + 0.431𝐴𝐺𝐸28̂ + 0.911𝐴𝐺𝐸33̂ + 1.048𝐴𝐺𝐸38̂ + 1.386𝐴𝐺𝐸43̂ + 1.456𝐴𝐺𝐸48̂ − 0.000754𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸28̂ − 0.179𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸33̂ − 0.152𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸38̂ − 0.543𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸43̂ − 0.326𝑆𝑇𝑅𝐼𝐶𝑇𝑥𝐴𝐺𝐸48̂ (1) 𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ 𝑆𝑎𝑙𝑙𝑦 = 5.352 + 1.011 + 1.456 − 0.326 = 7.493 (2) 𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ 𝐹𝑟𝑎𝑛𝑘 = 5.352 + 1.456 = 6.808 (3) 𝐿𝐺𝐼𝑉𝐼𝑁𝐺̂ 𝑆𝑎𝑙𝑙𝑦−𝐹𝑟𝑎𝑛𝑘 = 7.493 − 6.808 = 0.685 (4) As seen in equation (2), the inclusion of the STRICTxAGE48 interaction decreases the association of being an evangelical church member on giving in log dollars (as shown by the 1.011 coefficient on the variable STRICT) by 0.326, on average, in the population. It can also be said that
  • 5. S-052 Final Examination 5 the interaction decreases the association of being categorized in the 48 year-old age bracket on giving in log dollars (as shown by the 1.456 coefficient on AGE48) by 0.326, on average, in the population. Therefore, in the population, an evangelical church member in the 48-year-old category, on average, gives 0.685, not 1.011, more log dollars than a non-evangelical church member in the 48 year-old category as shown in equation (4). 5(a). This model is the total-regression model. Although it is sufficient to estimate individual- level coefficients, it ignores group membership, thus leading to correlation among residuals. This is not the case in a random-effects model, which assumes that residuals at both the individual-level and the group-level are independent and normally distributed. 5(b). This model is a fixed-effects model. This model does consider group membership. However, unlike the random-effects model, the fixed-effects model accounts for group-level residuals by estimating multiple, group-specific intercepts. Consequently, inferences about coefficients on level-two variables may not be made; therefore, comparisons between different groups may not be made. As mentioned earlier, a random-effects model, in contrast, assumes that group-level population residuals are normally distributed and independent; hence, it is possible to infer level-two variable coefficients and compare different groups. 6(a). 𝐿𝑜𝑔𝑖𝑡(𝑃(𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1)) = 𝛽0 + 𝛽1 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 + 𝛽2 𝑆𝐸𝑆 + 𝛽3 𝑀𝐼𝑁𝑥𝑆𝐸𝑆 6(b). 𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1)) = 𝛽0 + 𝛽2 𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1)) = 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3 𝛽0 + 𝛽2 = 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3 0 = 𝛽1 + 𝛽3 I test that 𝛽1 + 𝛽3 = 0. According to the first GLH test result on page 52 of the evidentiary materials, the 𝜒2 test statistic is 18.86 (p<0.001). I, therefore, reject the null hypothesis that there
  • 6. S-052 Final Examination 6 is no difference in the log-odds of college-going between minority and non-minority adolescent males of low socioeconomic status (SES=1), on average, in the population. 6(c). 𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5)) = 𝛽0 + 𝛽2 ∗ 4.5 𝐿𝑜𝑔𝑖𝑡(𝑃( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5)) = 𝛽0 + 𝛽1 + 𝛽2 ∗ 4.5 + 𝛽3 ∗ 4.5 𝛽0 + 𝛽2 ∗ 4.5 = 𝛽0 + 𝛽1 + 𝛽2 ∗ 4.5 + 𝛽3 ∗ 4.5 0 = 𝛽1 + 𝛽3 ∗ 4.5 Here, I test that 𝛽1 + 𝛽3 ∗ 4.5 = 0. According to the second GLH test result on page 54 of the evidentiary materials, the 𝜒2 test statistic is 12.10 (p<0.001). I, therefore, reject the null hypothesis that there is no difference in the log-odds of college-going between minority and non- minority adolescent males of high socioeconomic status (SES=4.5), on average, in the population. 7(a). Fitted model in logit space: 𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1)) = −4.838 + 1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 + 1.502𝑆𝐸𝑆 − 0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆 Fitted model in probability space: 𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1) = 1 1 + 𝑒−(−4.838+1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌+1.502𝑆𝐸𝑆 −0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆) Estimated log-odds and probability of a non-minority adolescent male of low socioeconomic status going to college: 𝐿𝑜𝑔𝑖𝑡 (𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1)) = −3.336 𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1) = 0.034 Estimated log-odds and probability of a minority adolescent male of low socioeconomic status going to college:
  • 7. S-052 Final Examination 7 𝐿𝑜𝑔𝑖𝑡 (𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1)) = −2.374 𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1) = 0.085 Estimated odds-ratio for a minority adolescent male of low socioeconomic status going to college versus a non-minority adolescent male of low socioeconomic status going to college: 𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 1) 𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 1) = 0.085 1 − 0.085 0.034 1 − 0.034 = 0.093 0.036 = 2.617 7(b). Fitted model in logit space: 𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1)) = −4.838 + 1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 + 1.502𝑆𝐸𝑆 − 0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆 Fitted model in probability space: 𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1) = 1 1 + 𝑒−(−4.838+1.444𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌+1.502𝑆𝐸𝑆 −0.482𝑀𝐼𝑁𝑥𝑆𝐸𝑆) Estimated log-odds and probability of a non-minority adolescent male of high socioeconomic status going to college: 𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5)) = 1.921 𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5) = 0.872 Estimated log-odds and probability of a minority adolescent male of high socioeconomic status going to college: 𝐿𝑜𝑔𝑖𝑡(𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5)) = 1.196 𝑃̂( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1| 𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5) = 0.768 Estimated odds-ratio for a minority adolescent male of high socioeconomic status going to college versus a non-minority adolescent male of high socioeconomic status going to college:
  • 8. S-052 Final Examination 8 𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 1, 𝑆𝐸𝑆 = 4.5) 𝑂𝑑𝑑𝑠( 𝐶𝑂𝐿𝐿𝐸𝐺𝐸 = 1|𝑀𝐼𝑁𝑂𝑅𝐼𝑇𝑌 = 0, 𝑆𝐸𝑆 = 4.5) = 0.768 1 − 0.768 0.872 1 − 0.872 = 3.307 6.828 = 0.484 7(c). Probabilities for going to college differ between white and minority adolescent males, not just according to race, but also according to socioeconomic status. As expected, college-going rates increase for both white and minority male adolescents as their socioeconomic status increases. However, there are some surprising racial differences in how these rates change. When comparing male students of only low socioeconomic backgrounds (SES=1), the estimated probability for going to college is greater for minority students (8.5%) than for whites (3.4%). In fact, on average, the odds of a minority adolescent male’s going to college are higher by a factor of 2.617 when compared to the odds of a white adolescent male’s going to college when the students are of low socioeconomic level. As socioeconomic status increases, however, the difference between their estimated probabilities decreases, as seen in the narrowing space between the solid and dashed curves above SES values of 1 and 3 in the graph on page 51 of the evidentiary materials. At some point past SES=3, the fitted curves cross, and the estimated probability of going to college becomes higher for whites than for minority adolescent males. For instance, at a high socioeconomic level (SES=4.5), the probability of a minority adolescent male’s going to college is 76.8% while that of a white adolescent male’s going to college is 87.2%. At SES=4.5, the odds ratio of a minority adolescent male’s going to college versus a white adolescent male’s going to college is only 0.484. In order to ensure that these college-going differences existed in the population and not only in the sample, I utilized a couple of post-hoc GLH tests. The first tested the null hypothesis that there is no difference in the population between the log-odds of a white and minority student’s going to college at a low socioeconomic level (SES=1). The second tested the null hypothesis that there is no difference in the population between the log-odds of a white and minority student’s going to college at a high socioeconomic level (SES=4.5). In both tests, the null hypothesis was rejected: Test 1 at SES=1, 𝜒(1) 2 = 18.86 (𝑝 < 0.001) Test 2 at SES=4.5, 𝜒(1) 2 = 12.10 (𝑝 < 0.001)
  • 9. S-052 Final Examination 9 I, therefore, concluded that there are indeed significant differences in the population between the log-odds of white and minority adolescent males’ going to college at both low and high socioeconomic levels, not just in the sample. 8(a). Week Fitted Hazard Logit Fitted Hazard Probability Fitted Survival Probability 2 -1.619 0.165342829 0.834657171 3 -1.6982 0.154700502 0.705535288 4 -2.042 0.114863236 0.624495222 5 -2.564 0.071491565 0.579849081 6 -2.583 0.070240558 0.539120159 8(b). To find the fitted hazard logit for week 2, I summed the constant (-1.682) and the coefficient for FEMALE (0.0630) since the constant serves as the reference for both the coefficients on the week dummy variables and the female variable. This generated a value of - 1.619. Then to find the fitted hazard probability, I used the following equation: 𝑝 = 1 1 + 𝑒−𝑙𝑜𝑔𝑖𝑡 = 1 1 + 𝑒−(−1.619) = 0.165 To find the fitted survival probability, I simply subtracted the fitted hazard probability from 1, giving me 0.835. In order to determine the fitted hazard logit for week 3, I summed the constant (-1.682), the coefficient for FEMALE (0.0630), and the coefficient for the week 3 dummy variable (- 0.0792). This gave me a value of -1.6982. I then calculated the fitted hazard probability by plugging this value into the logit to probability conversion equation: 𝑝 = 1 1 + 𝑒−𝑙𝑜𝑔𝑖𝑡 = 1 1 + 𝑒−(−1.6982) = 0.155 Finally, to find the fitted survival probability of week 3, I multiplied the difference between 1 and the fitted hazard probability of week 3 with the fitted survival probability of week 2: (1 − 0.155) ∗ 0.835 = 0.706
  • 10. S-052 Final Examination 10 8(c). I estimate the Kaplan-Meier 67th percentile survival time to occur during the fourth relative week of enrollment. Given that the fitted survival probability of week 3 for female students is 70.6% (i.e. that 70.6% of female students will log activity not only in their third but also in their fourth week of enrollment) and that the fitted survival probability of week 4 is 62.4%, it is reasonable to estimate that by sometime in the their fourth week of enrollment, 33% of female students who enrolled in the first three absolute weeks of the course and who did not drop out in their first relative week will have logged their last activity. 9(a). Whereas principal component analysis weights items in order to maximize variance and factor analysis weights items accounting for their reliability, sum-score simply weights each item equally. Consequently, one would not be able to take advantage of inter-item correlations or item reliability to help explain the story behind the items. For example, in the Gambia test, items that indicate ownership of expensive products like a good roof or car may have been intentionally included under a theoretical motivation to measure a family’s overall net worth by placing more “importance” in the ownership of these expensive products. If a sum-score method were utilized, the items indicating ownership of expensive products would be weighted equally to items that indicate ownership of inexpensive products (such as radios), thus contradicting the original intent of the researchers. Even if it turned out that the researchers’ theoretical assumptions in the creation and use of their items were misguided, such as the case of the low weight placed on CAR in the principal components analysis of the Gambia context, using a sum-score method would not identify those errors, unlike a principal components analysis or factor analysis. Moreover, because one does not account for item reliability by using a sum-score, replication of items is not possible, therefore leaving researchers with only one version of the test. Having only one version means that researchers may not really know what their items are asking. We are discovering more and more the challenges of composing test items when it comes to construct validity. Often we cannot be certain what it is we are measuring with a specific item. Only through replication of items can we refine processes that attempt to ensure the construct validity of test items. 9(b). According to the principal components table on page 66 of the evidentiary materials, the indicators RADIO and CAR have the lowest weights (0.2345 and 0.1679, respectively) among all
  • 11. S-052 Final Examination 11 9 indicators in component 1. This is most likely due to the low item-rest correlations for RADIO and CAR (0.2967 and 0.1783, respectively) as illustrated in the first table on page 65 of the evidentiary materials. If we consider RADIO’s low item-rest correlation as due to a radio’s obsolescence and CAR’s low item-rest correlation as due to an automobile’s expensive cost, we may then interpret component 1 as a measurement of a family’s preference to spend money in order to save money. All the other indicators are weighted similarly, and all of them in some way or another save money or generate more money. All the structural improvements, of course, increase the value of the family’s home. The TV and the refrigerator also save money in that a family can stay home for entertainment and preserve their food for longer periods of time. In contrast, a radio may be inexpensive and entertaining, but its usefulness is increasingly outdated; hence, its weight is positive yet smaller than the weights of the aforementioned items. A car may save some money when traveling, but overall, it is a drain on financial resources due to maintenance costs and depreciation; therefore, its weight is the smallest of all the indicators. In contrast, RADIO and CAR have the two heaviest weights in component 2 among all the indicators. Indicators for ownership of a television and refrigerator are also positively weighted while indicators for possession of structural improvements are negatively weighted in this component. As a result, I interpret this component as a measure of a family’s ability to pay for energy, which I believe to be an interpretation uncorrelated with that of component 1. Televisions, refrigerators, and radios need electricity to operate. Automobiles need gasoline. I assume that only families that can afford to pay for these energy sources would also buy these products, lest they become useless. I used the third table on page 65 and the scree plot on page 66 of the evidentiary materials to assist me in constructing a bi-dimensional basis for measuring a Gambian family’s financial position. The table demonstrates that component 1 has an eigenvalue of 5.488, which accounts for 60.98% of the original, standardized variance of the nine indicators ( 5.488 9 = .6098). Component 2 has an eigenvalue of 1.453, meaning that it accounts for 1.453 of the remaining 3.512 units (9 − 5.488 = 3.512) of standardized variance. According to the table, component 2 accounts for 16.14% of the original standardized variance. Together, components 1 and 2 account for 77.12% of the original standardized variance. While inspecting the scree plot, I notice that the “elbow rule” directs me to keep the first two components in constructing my basis for measuring a family’s financial position in the Gambia
  • 12. S-052 Final Examination 12 (Note: I consider the “elbow” to be at the 3rd component, and so I choose to keep the first two components). Furthermore, this decision is supported by the “rule of one”; the eigenvalues of components 1 and 2 are both greater than one. Therefore, my final principal components model has a bidimensional basis for measuring Gambian families’ financial positions.