This handout is connected to the Mentoring Program Evaluation & Goals webinar from Monday, May 16, 2011, as part of the free monthly webinar series from Friends for Youth's Mentoring Institute.
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Evaluating Mentoring Programs P/PV
1. Public/Private Ventures
Brief
Evaluating Mentoring
Programs
Jean Baldwin Grossman
September 2009
2. Public/Private Ventures Board of Directors Research Advisory
is a national leader in
creating and strength-
Committee
ening programs that Matthew T. McGuire, Chair Jacquelynne S. Eccles, Chair
improve lives in low-income communities. We Principal University of Michigan
do this in three ways: Origami Capital Partners, LLC Robert Granger
Yvonne Chan William T. Grant Foundation
innovation Principal Robinson Hollister
We work with leaders in the field to identify Vaughn Learning Center Swarthmore College
promising existing programs or develop new The Honorable Renée Reed Larson
ones. Cardwell Hughes University of Illinois
Judge, Court of Common Pleas
Jean E. Rhodes
research The First Judicial District,
University of Massachusetts,
We rigorously evaluate these programs to Philadelphia, PA
Boston
determine what is effective and what is not. Christine L. James-Brown
Thomas Weisner
President and CEO
UCLA
action Child Welfare
We reproduce model programs in new League of America
locations, provide technical assistance Robert J. LaLonde
Professor
where needed and inform policymakers and
The University of Chicago
practitioners about what works.
John A. Mayer, Jr.
P/PV is a 501(c)(3) nonprofit, nonpartisan Retired, Chief Financial Officer
J. P. Morgan & Co.
organization with offices in Philadelphia, New
Anne Hodges Morgan
York City and Oakland. For more information,
Consultant to Foundations
please visit www.ppv.org.
Siobhan Nicolau
President
Hispanic Policy
Development Project
Marion Pines
Senior Fellow
Institute for Policy Studies
Johns Hopkins University
Clayton S. Rose
Senior Lecturer
Harvard Business School
Cay Stratton
Special Adviser
UK Commission for
Employment and Skills
Sudhir Venkatesh
William B. Ransford
Professor of Sociology
Columbia University
William Julius Wilson
Lewis P. and Linda L.
Geyser University Professor
Harvard University
2
3. Acknowledgments
This brief is a revised version of a chapter written for the Handbook of Youth Mentoring edited by
David DuBois and Michael Karcher (2005). David and Michael provided many useful comments
on the earlier version. Laura Johnson and Chelsea Farley of Public/Private Ventures helped
revise the chapter to make it more accessible to non-mentoring specialists and provided great
editorial advice.
Additional reference: DuBois, D. L. and M. J. Karcher, eds. 2005. Handbook of Youth Mentoring.
Thousand Oaks, CA: Sage Publications, Inc.
3
4. Introduction
Questions about mentoring abound. This article presents discussions of many issues
Mentoring programs around the country that arise in answering both implementation
are being asked by their funders and boards, or process questions and impact questions.
“Does this mentoring program work?” Process questions are important to address
Policymakers ask, “Does this particular type even if a researcher is interested only in
of mentoring—be it school-based or group or impacts, because one should not ask, “Does it
email—work?” These are questions about pro- work?” unless “it” actually occurred. The first
gram impacts. Researchers and operators also section covers how one chooses appropriate
want to know about the program’s processes: process and impact measures. The next sec-
What about mentoring makes it work? How tion discusses several impact design issues,
long should a match last to be effective? How including the inadequacies of simple pre/
frequently should matches meet? Does the post designs, the importance of a good com-
level of training, support or supervision of parison group and several ways to construct
the match matter? Does parental involvement comparison groups. The last section discusses
or communication matter? What types of common mistakes made when analyzing
interactions between youth and mentors lead evaluation data and presents ways to avoid
to positive changes in the child? Then there them. For a more complete discussion of
are questions about the populations served evaluation in general, readers are referred to
and what practices are most effective: Are par- Rossi et al. (1999); Shadish et al. (2002); and
ticular types of youth more affected by men- Weiss (1998). Due to space constraints, issues
toring than others? Are mentors with specific entailed in answering mediational questions
characteristics, such as being older or more are not addressed here.
educated, more effective than other mentors
or more effective with particular subgroups
of youth? Finally, researchers in particular are
interested in the theoretical underpinning
of mentoring. For example, to what degree
does mentoring work by changing children’s
beliefs about themselves (such as boosting
self-esteem or self-efficacy), by shaping their
values (such as their views about education
and the future) or by improving their social
and/or cognitive skills?
4
5. Measurement Issues
A useful guide in deciding what to measure is mentoring programs have more detailed
a program’s logic model or theory of change: ideas, such as wanting participants to experi-
the set of hypothesized links between the pro- ence specific program elements (academic
gram’s action, participants’ response and the support, for example, or peer interaction). If
desired outcomes. As Weiss states, with such these are critical components of the program
a theory in hand, “The evaluation can trace theory, they also make good candidates for
the unfolding of the assumptions” (1998, 58). process measures.
Rhodes et al. (2005) presents one possible
theory of change for mentoring: Process mea- A second level of process question concerns
sures describe the program’s actions; outcome the quality of the components: How good
measures describe what effects the program has. are the relationships? Are the training and
supervision useful? These are more difficult
dimensions to measure. Client satisfaction
Process Measures
measures, such as how much youth like their
The first question when examining a pro- mentors or how useful the mentors feel the
gram is: What exactly is the program as training is, are one gauge of quality. However,
experienced by participants? The effect the clients’ assessment of quality may not be
program will have on participants depends accurate; as many teachers say, the most enjoy-
on the realities of the program, not on its able class may not be the class that promotes
official description. All too frequently in the most learning. Testing mentors before
mentoring programs, relatively few strong and after training is an alternative quality
relationships form and matched pairs stop measure. Assessing the quality of mentoring
meeting. Process questions can be answered, relationships is a relatively unexplored area.
however, at several levels. Most basically, Grossman and Johnson (1999) and Rhodes et
one wants to know: Did the program recruit al. (2005) propose some measures.
appropriate youth and adults? Did adults and
youth meet as planned? Did all the compo- From a program operator’s or funder’s per-
nents of the program happen? Were mentors spective, how much process information
trained and supervised as expected? is “enough” depends on striking a balance
between knowing exactly what is happening
To address these questions, one examines in the program versus recognizing the service
the characteristics and experiences of the the staff could have provided in lieu of collect-
participants, mentors and the match, and ing data. Researchers should assess enough
compares them with the program’s expecta- implementation data to be sure the program
tions. For example, a mentoring program is actually delivering the services it purports to
targeting youth involved in criminal or vio- offer at a level and quality consistent with hav-
lent activity tracked the number of arrests of ing a detectable impact before spending the
new participants to determine whether they time and money to collect data on outcomes.
were serving their desired target populations Even if no impact is expected, it is essential to
(Branch 2002). A high school mentoring know exactly what did or did not happen to
program for struggling students tracked the the participants to understand one’s findings.
GPAs of enrolled youth (Grossman, Johnson Thus, researchers may want to collect more
1999). Two match characteristics commonly process data than typically would be collected
examined are the average completed length by operators to improve both the quality of
of the relationship and the average frequency their generalizations and their ability to link
of interaction. Like all good process mea- impacts to variation in participants’ experi-
sures, they relate to the program’s theory. To ences of core elements of the program.
be affected, a participant must experience a
sufficient dosage of the intervention. Some
5
6. Lesson: Tracking process measures is impor- the false impression that the program is a fail-
tant to program managers but essential for ure, when in fact the impacts on the chosen
evaluators. Before embarking on an evalua- variables may not yet have emerged.
tion of impacts, be sure the program is deliv-
ering its services at a quality and intensity that A good technique for selecting variables is
would lead one to expect impacts. to choose a range of proximal to more distal
expected impacts based on the program’s the-
ory of change, which also represents a set of
Outcome Measures
impacts ranging from modestly to impressively
An early task for an impact evaluator is to effective (Weiss 1998). Unfortunately, one can-
refine the “Does it work?” question into a not know a priori how long matches will last
set of testable evaluation questions. These or how often the individuals will meet. Thus,
questions need to specify a set of outcome it is wise to include some outcomes that are
variables that will be examined during the likely to change even with rather limited expo-
evaluation. There are two criteria for a good sure to the intervention, and some outcomes
outcome measure (Rossi et al. 1999). First, that would change with greater exposure,
the outcome can be realistically expected to thus setting multiple “bars.” The most basic
change during the study period given the effectiveness goal is an outcome that everyone
intensity of the intervention. Second, the out- agrees should be achievable. From there, one
come is measurable and the chosen measure can identify more ambitious outcomes.
sensitive enough to detect the likely change.
Public/Private Ventures’ evaluation of Big
Evaluation questions are not program goals. Brothers Big Sisters (BBBS) provides a good
Many programs rightly have lofty inspira- example of this process (Grossman and
tional goals, such as enabling all participants Tierney 1998). Researchers conducted a thor-
to excel academically or to become self- ough review of BBBS’s manual of standards
sufficient, responsible citizens. However, a and practices to understand the program’s
good evaluation outcome must be concrete, logic model and then, by working closely with
measurable and likely to change enough dur- staff from the national office and local agen-
ing the study period to be detected. Thus, for cies, generated multiple outcome bars. The
example, achieving a goal like “helping youth national manual lists four “common” goals
academically excel” could be gauged by exam- for a Little Brother or Little Sister: providing
ining students’ grades or test scores. social, cultural and recreational enrichment;
improving peer relationships; improving self-
In addition, when choosing the specific set of concept; and improving motivation, attitude
outcomes that will indicate a goal such as “aca- and achievement related to schoolwork.
demically excelling,” one must consider which Conversations with BBBS staff also suggested
of the possible variables are likely to change that having a Big Brother or Big Sister could
given the program dosage participants will reduce the incidence of antisocial behav-
probably receive during the evaluation period. iors such as drug and alcohol use and could
For example, researchers often have found improve a Little Brother’s or Little Sister’s
that reading and math achievement test scores relationship with his or her parent(s). Using
change less quickly than do reading or math previous research, the hypothesized impacts
grades, which, in turn, change less quickly were ordered from proximal to distal as fol-
than school effort. Thus, if one is evaluating lows: increased opportunities for social and
the school-year (i.e., nine months) impact of cultural enrichment, improved self-concept,
a school-based mentoring program, one is better relationships with family and friends,
likely to want to examine effort and grades improved academic outcomes and reduced
rather than test scores, or at least in addition antisocial behavior.
to test scores. Considerable care and thought
need to go into deciding what outcomes data At a minimum, the mentoring experience was
should be collected and when. Examining expected to enrich the cultural and social life
impacts on outcomes that are unlikely to of youth, even though many more impacts
change during the evaluation period can give were anticipated. Because motivational psy-
chology research shows that attitudes often
change before behaviors, the next set of
6
7. outcomes reflected attitudinal changes toward Information from each source has advantages
themselves and others. The “harder” academic and disadvantages. For example, for some
and antisocial outcomes then were specified. variables, such as attitudes or beliefs, the youth
Within these outcomes, researchers also may be the only individual who can provide
hypothesized a range of impacts, from attitu- valid information. Youth, for example, arguably
dinal variables, such as the child’s perceived are uniquely qualified to report on constructs
sense of academic efficacy and value placed such as their self-esteem (outcome measures)
on education, to some intermediate behav- or considerations such as how much they like
ioral changes, such as school attendance and their mentors or whether they think their men-
being sent to the principal’s office, to changes tors support and care for them (process mea-
in grades, drug and alcohol use, and fighting. sures). Theoretically, what may be important is
not what support the mentor actually gives but
Once outcomes are identified, the next ques- how supportive the youth perceives the mentor
tion is how to measure them. Two of the most to be (DuBois et al. 2002).
important criteria for choosing a measure are
whether the measure captures the exact facet On the other hand, youth-reported data may
of the outcome that the program is expected be biased. First, youth may be more likely to
to affect and whether it is sensitive enough to give socially desirable answers—recounting
pick up small changes. For example, an aca- higher grades or less antisocial behavior.
demically focused mentoring program that If this bias is different for mentored versus
claims to increase the self-esteem of youth may nonmentored youth, impact estimates based
help youth feel more academically competent on these variables could be biased. Second,
but not improve their general feelings of self- the feelings of youth toward their mentors
worth. Thus, one would want to use a scale may taint their reporting. For example, if the
targeting academic self-worth or competence youth does not like the mentor’s style, he or
rather than a global self-worth scale—or select she may selectively report or overreport cer-
a scale that can measure both. The second tain negative experiences, such as the mentor
consideration is the measure’s degree of sen- missing meetings, and underreport others of
sitivity. Some measures are extremely good at a more positive nature, such as the amount of
sorting a population or identifying a subgroup time the mentor spends providing help with
in need of help but poor in detecting the small schoolwork. Similarly, the youth may overstate
changes that typically result from programs. a mentor’s performance to make the mentor
For example, in this author’s experience, the look good. Last, the younger the child is, the
Rosenberg self-esteem scale (1979) is useful in less reliable or subtle the self-report. For this
distinguishing adolescents with high and low reason, when participants are quite young (8
self-esteem but often is not sensitive enough to or 9 years old), it is advisable to collect infor-
detect the small changes in self-esteem induced mation from their parents and/or teachers.
by most youth programs. On the other hand,
measures of academic or social competency The mentor often can be a good source of
beliefs (Eccles et al. 1984) can detect relatively information about what the mentoring expe-
small changes. rience is like, such as what the mentor and
mentee do and talk about (process measures),
Lesson: Choose outcomes that are integrally and as a reporter on the child’s behaviors at
linked to the program’s theory of change, that posttest (outcome measures). The main prob-
establish multiple “effectiveness bars,” that are lem with mentor reporting is that mentors
gauged with sensitive measures and that can be have an incentive to report positively on their
achieved within the evaluation’s time frame and relationships with youth and to see effects
in the context of the program’s implementation. even if there are none, justifying why they are
spending time with the child. Although there
may be a positive bias, this does not preclude
Choosing Informants
mentors’ being accurate in reporting rela-
Another issue to be resolved for either process tive impacts. This is because most mentors do
or outcome measures is from whom to collect not report that their mentees have improved
information. For mentoring programs, the equally in all areas. The pattern of differ-
candidates are usually the youth, the mentor, ence in these reports, especially if consistent
a parent, teachers and school records.
7
8. with those obtained from other sources, such
as school records, may provide useful infor-
mation about the true impacts.
Parents also can be useful as reporters. They
may notice that the child is trying harder in
school, for example, even though the child
might not notice the change. However, like
the mentor, parents may project changes that
they wish were happening or be unaware of
certain behaviors (e.g., substance use).
Finally, teachers may be good reporters on
the behaviors of their students during the
school day. Teachers who are familiar with
age-appropriate behavior, for example, may
spot a problem when a parent or mentor does
not. However, teachers are extraordinarily
busy, and it can be difficult for them to find
the time to fill out evaluation forms on the
participants. In addition, teachers too are not
immune to seeing what they want to see, and
as with mentors and parents, the caveat about
relative impacts applies here.
Information also can be collected from
records. Data about the occurrence of spe-
cific events—fights, cut classes, principal
visits—are less susceptible to bias, unless the
sources of these data (e.g., school adminis-
trators making discipline decisions) differ-
entially judge or report events for mentored
youth versus other youth.
Lesson: Each respondent has a unique point
of view, but all are susceptible to reporting
what they wish had happened. Thus, if time
and money allow, it is advantageous to exam-
ine multiple perspectives on an outcome and
triangulate on the impacts. What is important
is to see a consistent pattern of impacts (not
uniform consistency among the respondents).
The more consistency there is, the more
certain one can be that a particular impact
occurred. For example, if the youth, parent
and teacher data all indicate school improve-
ment and test scores also increase, this would
be particularly strong evidence of academic
gains. Conversely, if only one of these mea-
sures exhibits change (e.g., parent reports), it
could be just a spurious finding.
8
9. Design Issues
Answering the questions “Does mentoring in the absence of the program, one must
work?” and “For whom?” may seem relatively identify another group of youth, namely a
straightforward—achievable simply by observ- comparison group, whose behavior will rep-
ing the changes in mentees’ outcomes. But resent what the participants’ behavior would
these ostensibly simple questions are harder have been without the program. Choosing
to answer than one might assume. a group whose behavior accurately depicts
this hypothetical no-treatment (or coun-
terfactual) state is the crux of getting the
The Fallacy of Pre/Post Comparisons
right answer to the effectiveness question,
The changes we observe in the attitudes, because a program’s impacts are ascertained
behaviors or skills of youth while they are by comparing the behavior of the treatment
being mentored are not equivalent to program or participant group with that of the selected
impacts. How can that be? The answer has to comparison group.
do with what statisticians call internal valid-
ity. Consider, again, the previously described
BBBS evaluation. If one looks only at changes Matched Comparison or Control Group
in outcomes for treatment youth (Grossman, Construction
Johnson 1999), one finds that 18 months after There are two principal types of comparison
they applied to the program, 7 percent had groups: control groups generated through
reported starting to use drugs. On the face of random assignment and matched comparison
it, it appears that the program was ineffective; groups selected judgmentally by the researcher.
however, during the same period, 11 percent of
the controls had reported starting to use drugs.
Experimental Control Groups
Thus, rather than being ineffective, this statisti-
cally significant difference indicates that BBBS Random assignment is the best way to create
was able to stem some of the naturally occur- two groups that would change comparably
ring increases in drug use. over time. In this type of evaluation, eligible
individuals are assigned randomly, either to
The critical distinction here is the difference the control group and not allowed into the
between outcomes and impacts. In evalua- program, or to the treatment group, whose
tion, an outcome is the value of any variable members are offered the program. (Note:
“Treatments” and “controls” refer to randomly
measured after the intervention, such as
selected groups of individuals. Not all treat-
grades. An impact is the difference between
ments may choose to participate. The term
the outcome observed and what it would
“participants” is used to refer to individuals
have been in the absence of the program
who actually receive the program.)
(Rossi et al. 1999); in other words, it is the
change in the outcome that was caused by
The principal advantage of random assign-
the program. Simple changes in outcomes
ment is that given large enough groups,
may in part reflect the program’s impact
on average, the two groups are statistically
but also might reflect other factors, such as
equivalent with respect to all characteristics,
changes due to maturation.
observed and unobserved, at the time the
two groups are formed. If nothing were done
Lesson: A program’s impact can be gauged
to either group, their behaviors, on average,
accurately (i.e., be internally valid) only if
would continue to be statistically equivalent
one knows what would have happened to
at any point in the future. Thus, if after the
the participants had they not been in the
intervention the average behavior of the two
program. This hypothetical state is called
groups differs, the difference can be confi-
the “counterfactual.” Because one cannot
dently and causally linked to the program,
observe what the mentees would have done
9
10. which was the only systematic difference Matched (or Quasi-Experimental)
between the two groups. See Orr (1999) for a Comparison Groups
discussion of how large each group should be. Random assignment is not always possible. For
example, programs may be too small or staff
Although random assignment affords the may refuse to participate in such an evalu-
most scientifically reliable way of creating two ation. When this is the case, researchers must
comparable groups, there are many issues that identify a group of nonparticipant youth whose
should be considered before using it. Two of outcomes credibly represent what would have
the most difficult are “Can random assign- happened to the participants in the absence of
ment be inserted into the program’s normal the program. The weakness of the methodol-
process without qualitatively changing the pro- ogy is that the outcomes of the two groups can
gram?” and “Is it ethical to deny certain youth differ not only because one group got a men-
a mentor?” However, it is worth noting that tor and the other did not but also because of
all programs ration their services, primarily other differences between the groups. To gen-
by not advertising to more people than they erate internally valid estimates of the program’s
can serve. Random assignment gives all needy impacts, one must control for the “other dif-
children an equal probability of being served, ferences” either through statistical procedures
rather than denying children who need a such as regression analysis and/or through
mentor by not telling them about the pro- careful matching.
gram. The reader is referred to Dennis (1994)
for a detailed discussion of the ethical issues The researcher selects a comparison group
involved in random assignment. of youth who are as similar as possible to the
participant group across all the important
With respect to the first issue, consider first characteristics that may influence outcomes
how the insertion of random assignment into in the counterfactual state (the hypotheti-
the intake process affects the program. One cal no-treatment state). Some key charac-
of the misconceptions about random assign- teristics are relatively easy to identify and
ment among mentoring staff is that it means match for (e.g., age, race, gender or family
randomly pairing youth with adults. This is not structure). However, to improve the cred-
the case. Random pairing would fundamentally ibility of a matched comparison group, one
change the program, and any evaluation of needs to think deeply about other potential
this altered program would not provide infor- differences that could affect the outcome dif-
mation on the effect of the actual program. A ferential, such as whether one group of youth
valid use of random assignment would entail comes from families that care enough and
randomly dividing eligible applicants between are competent enough to search out services
the treatment and control groups, then pro- for their youth, or how comfortable the youth
cessing the treatment group youth just as they are with adults. These critical yet hard-to-
normally would be handled and matched. measure variables are factors that are likely to
Under this design, random assignment affects systematically differ between participant and
only which youth files come across the staff’s comparison group youth and to substantially
desk for matching, not what happens to youth affect one or more of the outcomes being
once they are there. Another valid test would examined. The more readers of an evaluation
involve identifying two youth for every volun- can think of such variables that have not been
teer, then randomly assigning one child to the accounted for, the less they will believe the
treatment group and one to the control group. resulting program impact estimates.
For the BBBS evaluation, we used the former
method because it was significantly less bur- Consider, for example, an email mentor-
densome and emotionally more acceptable for ing program. Not only would one want the
the staff. However, the chosen design meant comparison group to match the participant
that not all treatment youth actually received group on demographic characteristics—age
a mentor. As will be discussed later, only (say, 12, 13, 14 or 15 years old), gender (male,
about three quarters of the youth who were female), race (white, Hispanic, black) and
randomized into the treatment group and income (poor, nonpoor)—but one might
offered the program actually received men- also want to match the two groups on their
tors. (See Orr 1999 for a rich discussion of preprogram use of the computer, such as
all aspects of random assignment.) the average number of hours per week spent
10
11. using email or playing computer games. To probabilities are calculated for both par-
match on this variable, however, one would ticipants and all potential nonparticipants.
have to collect computer use data on many Each participant then is matched with one or
nonparticipant youth to find those most com- more nonparticipant youth based on these
parable to the participants. predicted propensity scores. For example, for
each participant, the nonparticipant with the
When one has more than a few matching vari- closest predicted participation probability can
ables, the number of cells becomes too numer- be selected into the comparison group. (See
ous. In the above example, we would have Shadish et al. 2002, 161–165, for further dis-
4 age × 2 gender × 3 race × 2 income, or 48 cussion of PSM, and Dynarski et al. 2003 for
cells, even before splitting by computer use. A an application in a school-based setting.)
method that is used with increasing frequency
to address this issue is propensity score match- An implication of this technique is that one
ing (PSM). A propensity score is the probability needs data for the propensity score logit from
of being a participant given a set of known fac- both the participant group and a large pool of
tors. In simple random assignment evaluations, nonparticipant youth who will be considered
the propensity score of every sample member is for inclusion in the comparison group. The
50 percent, regardless of his or her characteris- larger the considered nonparticipant pool is,
tics. In the real world, without random assign- the more likely it is that one can find a close
ment, the probability of being a participant propensity score match for each participant.
depends on the individual’s characteristics, This data requirement often pushes research-
such as his or her comfort with computers in ers to select matching factors that are readily
the example above. Thus, participants and available through records rather than incur
nonparticipants naturally differ with regard to the expense of collecting new data.
many characteristics. PSM can help researchers
select which nonparticipants best match the One weakness of this method is that although
participant group with respect to a weighted the propensity to participate will be quite simi-
average of all these characteristics (where the lar for the participant and comparison groups,
weights reflect how important the factors are in the percentage with a particular characteris-
making the individual a participant). tic (such as male) may not be, because PSM
matches on a linear combination of character-
To calculate these weights, the researcher istics, not each characteristic one by one. To
estimates, across both the participant and overcome this weakness, most studies match
nonparticipant samples, a logistic model of propensity scores within a few demographi-
the probability of being a participant (Pi) cally defined cells (such as race/gender).
as a function of the matching variables and
all other factors that are hypothesized to be PSM also balances the two groups only on the
related to participation (Rosenbaum, Rubin factors that went into the propensity score
1983; Rubin 1997). For example, if one were regression. For example, the PSM in Dynarski
evaluating a school-based mentoring program, et al. (2003) was based on data gathered from
the equation might include age, gender, race, 21,000 students to generate a comparison
household status (HH) and reduced-price- group for their approximately 2,500 partici-
lunch status (RL), as well as past academic pants. However, when data were collected later
(GPA) and behavior (BEH) assessments, as is on parents, it turned out that comparison
shown in Equation 1 below. Obtaining teacher group students were from higher-income fami-
ratings of the youth’s interpersonal skills lies. No matter how carefully a comparison
(SOC) also would help match on the youth’s group is constructed, one can never know for
ability to form a relationship. sure how similar this group is to the participant
group on unmeasured characteristics, such as
(1) Pi = f(age, gender, race, HH, RL, GPA, BEH, SOC) their ability to respond to adult guidance.
The next step of PSM is to compute for each
potential member of the sample the prob-
ability of participation based on the matching
characteristics in the regression. Predicted
11
12. Lesson: How much a reader trusts the internal
validity of an evaluation depends on how much
he or she trusts that the comparison group
truly is similar to the participant group on all
important dimensions. This level of trust or
confidence is quantifiable in random assign-
ment designs (e.g., one is 95 percent confident
that the two groups are statistically equivalent),
whereas with a quasi-experimental design, this
level of trust is uncertain and unquantifiable.
12
13. Analysis
This section covers how impact estimates are (unmeasured factors). Another way to think
derived, from the simplest techniques to more of b is that it is basically the difference in the
statistically sophisticated ones. Several com- mean Ys, adjusting for differences in Xs.
monly committed errors and techniques used
to overcome these problems are presented. When data are from a quasi-experimental
evaluation, it is always best to estimate impacts
using regression or analysis of covariance;
The Basics of Impact Estimates
not only does one get more precise estimates,
Impact estimates for both experimental and but one can control for any differences that
quasi-experimental evaluation are basically do arise between the participant and the
determined by contrasting the outcomes of comparison groups. Regression simulates
the participant or treatment group with those what outcomes youth who were exactly like
of the control or comparison group. If one participants on all the included characteris-
has data from a random assignment design, tics (the Xs) would have had if they had not
the simplest unbiased impact estimate is the received a mentor, assuming that all factors
difference in mean follow-up (or posttest) out- that jointly affect participation and outcomes
comes for the treatment and control groups, are included in the regression. Regressions
as in Equation 2, are also useful in randomized experiments for
estimating impacts more precisely.
(2) b = Mean(Yfu,T) − Mean(Yfu,C)
where b is the estimated impact of the pro- Suspicious Comparisons
gram, Yfu,T is the value of outcome Y at post- The coefficient b from Equation 3 is an unbi-
test or follow-up for the treatment group ased estimate of the program’s impact (i.e.,
youth, and Yfu,C is the value of outcome Y at the estimate differs from the true impact
posttest or follow-up for the control group only by a random error with mean of zero)
youth. One can increase the precision of the as long as the two groups are identical on all
impact estimate by calculating the change- characteristics (both included and excluded
score or difference-in-difference estimator as variables). The key to obtaining an unbiased
in Equation 3, estimate of the impact is to ensure that one
compares groups of youth that are as similar
(3) b = Mean(Yfu,T − Ybl,T) – Mean(Yfu,C − Ybl,C) as possible on all the important observable
and unobservable characteristics that influ-
where Ybl,T is the value of outcome Y at base- ence outcomes. Although many researchers
line for the treatment group youth, and Ybl,C understand the need for comparability and
is the value of outcome Y at baseline for the indeed think a lot about it when construct-
control group youth. ing a matched comparison group, this pro-
found insight is often forgotten in the analysis
Even more precision can be gained if the phase, when the final comparisons are made.
researcher controls for other covariate factors Most notably, if one omits youth from either
that affect the outcome through the use of group—the randomly selected treatment (or
regression, as in Equation 4, self-selected participant) group or the ran-
domly selected control (or matched compari-
(4) Yfu = a + bT + cYbl + dX + u son) group—the resulting impact estimate is
potentially biased. Following is a list of com-
where b is the estimated impact of the pro- monly seen yet flawed comparisons related to
gram, T is a dummy variable equal to 1 for this concern.
treatments and 0 for controls, and X is a vec-
tor of baseline covariates that affect Y and u
13
14. Suspect Comparison 1: Comparing groups of On the other hand, the estimate based on all
youth based on their match status, such as compar- treatments and all controls, called the “intent-
ing those who received a mentor or youth whose to-treat effect,” is unaffected by this bias.
matches lasted at least one month with the control
or comparison group. Suppose, as occurred Because the intent-to-treat estimate is based
in the Public/Private Ventures evaluation on the outcomes of all of the treatment youth,
of BBBS’s community-based mentoring whether or not they received the program,
program, only 75 percent of the treatment it may underestimate the “impact on the
group actually received mentors (Grossman, treated” (i.e., the effect of actually receiving
Tierney 1998). Can one compare the out- the treatment). A common way to calculate
comes of the 75 percent who were mentees the “impact on the treated” is to divide the
with the controls to get an unbiased estimate intent-to-treat estimate by the proportion
of the program’s impact? No. All the impact of youth actually receiving the program
estimates must be based on comparisons (Bloom 1984). The intent-to-treat estimate is a
between the entire treatment group and the weighted average of the impact on the treated
entire control group to maintain the com- youth (ap) and the impact on the untreated
plete comparability of the two groups. (This youth (anp), as shown in Equation 5,
estimate often is referred to as the impact of
the “intent to treat.”) (5) Mean(T) − Mean(C) = a = p ap + (1 −p) anp
There are undoubtedly factors that are sys- where p = proportion treated.
tematically different between youth who form
mentoring relationships and those who do If the effect of group assignment on the
not. The latter youth may be more difficult untreated youth (anp) is zero (i.e., untreated
temperamentally, or their families may have treatment individuals are neither hurt
decided they really did not want mentors nor helped), then a is to equal a/p. Let’s
and withdrew from the program. If research- again take the example of the BBBS evalu-
ers remove these unmatched youth from the ation. Recall that 18 months after random
treatment group but do nothing with the assignment, 7 percent of the treatment
control group, they could be comparing the group youth (the treated and untreated)
“better” treatment youth with the “average” had started using drugs, compared with
control group child, biasing the impact esti- 11 percent of the control group youth, a
mates. Randomization ensures that the treat- 4-percentage-point reduction. Using the
ment and control groups are equivalent (i.e., knowledge that only 75 percent of the youth
there are just as many “better” youth in the actually received mentors, the “impact on
control group as the treatment group). After the treated” of starting to use drugs would
the intervention, matched youth are read- increase from a 4-percentage-point reduction
ily identified. Researchers, however, cannot to a 5.3-percentage-point reduction (= 4/.75).
identify the control group youth who would
have been matched successfully had they been Similar bias occurs if one removes con-
given the opportunity. Thus, if one discarded trol group members from the comparison.
the unmatched treatment youth, implicitly Reconsider the school-based mentoring
one is comparing successfully matched youth example described above, where treatment
to a mixed group—those for whom a match youth are offered mentors and control youth
would have been found (had they been offered are denied mentors for one year. Suppose
participation) and those for whom matches that although most youth participate for only
would not be found (who are perhaps harder a year, some continue their matches into a
to serve). An impact estimate based on such a second school year. To gauge the impact of
comparison has the potential to bias the esti- this longer intervention, the evaluator might
mate in favor of the program’s effectiveness. (incorrectly) consider comparing youth who
(The selection bias embedded in matching is had mentors for two years with control youth
the reason researchers might choose to com- who were not matched after their one-year
pare the outcomes of a matched comparison denial period. This comparison has several
group with the outcomes of mentoring pro- problems. Youth who were able to sustain
gram applicants, rather than participants.) their relationships into a second year, for
example, would likely be better able to relate
14
15. to adults and perhaps more malleable to a with negative outcomes disappear, while
mentoring intervention than the “average” the indications of positive effects of longer
originally matched comparison group mem- matches remained.
ber. An unbiased way to examine these pro-
gram impacts would be to compare groups A similar problem occurs when comparing
that were assigned randomly at the beginning youth with close relationships with those with
of the evaluation: one group being offered the weaker relationships. For the straightforward
possibility of a two-year match and the other comparison to be valid, one is implicitly
being denied the program for two years. To assuming that youth who ended up with close
investigate both one- and two-year versions of relationships with their mentors would have,
the program, applicants would need to be ran- in the absence of the program, fared equally
domized into one of three groups: one group well or poorly as youth who did not end up
offered the possibility of a two-year match, with close relationships. If those with closer
one group offered the possibility of a one-year relationships would have, without the pro-
match and one group denied the program for gram, been better able to secure adult atten-
the full two years. tion than the other youth and done better
because of it, for example, then a comparison
Lesson: The only absolutely unbiased estimate of the close-relationship youth with either
from a random assignment evaluation of a youth in less-close relationships or with the
mentoring program is based on the compari- control/matched comparison group could
son of all treatments and all controls, not just be flawed.
the matched treatments or those matched for
particular lengths of time. Lesson: Any examination of groups defined by
a program variable—such as having a mentor,
Suspect Comparison 2: Comparing effects based on the length of the relationship, having a cross-
relationship characteristics, such as short matches race match—is potentially plagued by selec-
with longer matches or closer relationships with tion bias regardless of the evaluation design
less close relationships. Grossman and Rhodes employed. Valid subgroup estimates can be
(2002) examined the effects of different calculated only for subgroups defined on pre-
lengths of matches using the BBBS evalua- program characteristics, such as gender or race
tion data. In the first part of the paper, the or preprogram achievement levels or grades. In
researchers reported the straightforward these cases, we can precisely identify and make
comparisons between outcomes of those comparisons to a comparable subgroup within
matched less than 6 months, 6 to 12 months the control group (against which the treatment
and more than 12 months with the control subgroup may be compared).
group’s outcomes. Although interesting,
these simple comparisons ignore the poten- Suspect Comparison 3: Comparing the outcomes of
tial differences among youth who are able to mentored youth with a control or matched compari-
sustain their mentoring relationships for dif- son group when the sample attrition at the follow-up
ferent periods of time. If the different match assessment is substantial or, worse yet, when there
lengths were induced randomly across pairs is differential attrition between the two groups.
or the reasons for a breakup were unrelated Once again, unless those who were assessed
to the outcomes being examined, then there at posttest were just like the youth for whom
would be no problem with the simple set of one does not have posttest data, the impact
comparisons. However, if, for example, youth estimates may be biased. Suppose youth from
who cannot form relationships that last more the most mobile, unstable households are the
than five months are less able to get the adult ones who could not be located. Comparing
attention and resources they need and conse- the “found” treatment and controls only pro-
quently would do worse than longer-matched vides information about the impact of the
youth even without the intervention, then the program on youth from stable homes, not all
first set of comparisons would produce biased youth. This is an issue of generalizability (i.e.,
impact estimates. Indeed, when the research- external validity; see Shadish et al. 2002).
ers statistically controlled for this potential
bias (using two-staged least squares regres- Differential attrition between the treat-
sion, as discussed below), they saw evidence ment and the control (or participant and
of the strong association of short matches comparison) groups is important because
15
16. it also poses a threat to internal validity. Statistical Corrections for Biases
Frequently, researchers are able to reassess What if one wants to examine program
a much higher fraction of program partici- impacts under these compromised situa-
pants—many of whom may still be meeting tions—such as dealing with differential attri-
with their mentors—than of the control or tion or examining the impact of mentoring
comparison group youth (whom no one has on youth whose matches have lasted more
necessarily tracked on a regular basis). For than a year? There are a variety of statistical
example, if the control or comparison group methods to handle these biases. As long as the
youth demonstrate increased behavioral or assumptions underlying these methods hold,
academic problems over the sample period, then the resulting adjusted impact estimates
parents may move their children to attend should be unbiased.
other schools and thus make data collection
more difficult. Alternatively, some treatment Let’s start by restating the basic hypothesized
families may have decided not to move out of model:
the area because the children had good men-
tors. Under any of these scenarios, comparing (6) Yfu = a + bM + cYbl + dX + u
the reassessed comparison group youth with
reassessed mentees could be a comparison of The value of outcome Yfu is determined by its
unlike individuals. value at baseline (Ybl), whether the child got
mentoring (M), a vector of baseline covariates
Technically, any amount of attrition—even that affect Y (X) and unmeasured factors (u).
if it is equal across the two groups—puts Suppose one has information on a group of
the accuracy of the impact estimates into mentees and a comparison group of youth
question. The treatment group youth who matched on age, gender and school. Now
cannot be located may be fundamentally suppose, however, the youth who actually get
different from control group youth who can- mentors differ from the comparison youth
not be located. For example, the control in that they are more likely to be firstborn. If
attriters might be the youth whose parents firstborn youth do better on outcome Y (even
enroll them in new schools because they are controlling for the baseline level of Y) and
not doing well, while the treatment attriters one fails to control for this difference, the
might be the youth whose parents moved. estimated impact coefficient (b) will be biased
However, as long as one can show that the upward, picking up not only the effect of
baseline characteristics of the two groups are mentoring on Y but also the “firstborn-ness”
similar, most readers will accept the hypoth- of the mentees. The problem here is that M
esis that the two groups of follow-up respond- and u are correlated.
ers are still similar. Similarly, if the baseline
characteristics of the attriters are the same If one hypothesizes that the only way the
as those of the responders, then we can be participating youth differ from the average
more confident that the attrition was simply nonparticipating youth is on measurable
random and that the impact on the respond- characteristics (Z)—for example, they are
ers is indicative of the impact on all youth. more likely to be firstborn or to be Hispanic—
then including these characteristics in the
Lesson: Comparisons of treatment (or parti- impact regression model, Equation 7, will
cipant) groups and control (or compari- fully remove the correlation between M and
son) groups are completely valid only if the u, because M conditional on (i.e., controlling
youth not included in the comparison are for) Z is not correlated with u. Thus, Equation
simply a random sample of those included. 7 will produce an unbiased estimate of the
This assumption is easier to believe if the impact (b):
nonincluded individuals represent a small
proportion of the total sample, the baseline (7) Yfu = a + bM + cYbl + dX +fZ + u
characteristics of nonresponders are similar
to those of responders and the proportions Including such extra covariates is a common
excluded are the same for the treatment and technique. However if, as is usually the case,
control groups. one suspects (or even could plausibly argue)
that the mentored group is different in other
ways that are correlated with outcomes and
16
17. are unmeasured, such as being more socially “working” (i.e., having longer duration) but
competent or from better-parented families, not related theoretically to the child’s grades
then the estimate coefficient still will be or behaviors.
potentially biased.
Then one estimates the following regression
of M:
Instrumental Variables or Two-Staged
Least Squares
(10) M = k + mZ + nX + cY + w
Using instrumental variables (IV), also called bl
two-staged least squares regression (TSLS), is
where w is a random error. All of the covari-
a statistical way to obtain unbiased (or consis-
ates that will be included in the final impact
tent) impact estimates in this more compli-
Equation 7, X and Ybl are included in the
cated position (see Stock and Watson 2003,
first-stage regression along with the instru-
Chapter 10).
ments Z. A predicted value of M (M’ = k +
mZ + nX + cYbl) is then computed for each
Consider the factors influencing M (whether
sample member. The properties of regres-
the child is a mentee):
sion ensure that M’ will be uncorrelated with
the part of Yfu not accounted for by Ybl , or X
(8) M = k + mZ + nX + v
(i.e., u). M’ then is used in Equation 7 rather
than M. The second stage of TSLS estimates
where Z represents variables related to M
Equation 7 and the corrected standard errors
that are unrelated to Y, X represents variables
(see Stock and Watson 2003 for details). This
related to M that are related to Y and v is the
technique works only if one has good pre-
random error.
dictive instruments. As a rule of thumb, the
F-test for the Stage 1 regression should have
Substituting Equation 8 into Equation 6
a value of at least 10 if the instrument is to
results in:
be considered valid.
(9) Yfu = a + b(k + mZ + nX + v) + cYbl + u
Baseline Predictions
The problem is that v (the unmeasured ele- Suspect Comparison 2 illustrates how any
ments related to participating in a mentoring examination of groups defined by a program
program, such as having motivated parents) is variable, such as having a long relationship
correlated with u. This correlation will cause or a cross-race match, is potentially plagued
the regression to estimate a biased value for b. by the type of selection bias we have been
However, using instrumental variables, we are discussing. Schochet et al. (2001) employed
able to purge out v (the elements of M that a remarkably clever nonstatistical technique
are correlated with u) to get an unbiased esti- for estimating the unbiased impact of a
mate of the impact. Intuitively, this technique program in such a case. The researchers
constructs a variable that is not M but is highly knew they wanted to compare the impacts
correlated with M and is not correlated with u of participants who would choose different
(an “instrument”). versions of a program. However, because
one could not know who among the control
The first and most difficult step in using this group would have chosen each program ver-
approach is to identify variables that 1) are sion, it appeared that one could not make
related to why a child is in the group being a valid comparison. To get around this
examined, such as being a mentee or a long- problem, they asked the intake workers who
matched child, and 2) are not related to the interviewed all applicants before random
outcome Y. These are very hard to think of, assignment (both treatments and controls)
must be measured for both treatment and to predict which version of the program each
control youth, and need to be considered youth would end up in if all were offered the
before data collection starts. Examples might program. The researchers then estimated
include the youth’s interests, such as sports or the impact of Version A (and similarly B)
outdoor activities, or how difficult it is for the by comparing the outcomes of treatment
mentor to drive to the child’s home. These and control group members deemed to be
variables would be related to the match “A-likely” by the intake workers. Note that
17
18. they were not comparing the treatment
youth who actually did Version A to the
A-likely control youth, but rather compar-
ing the A-likely treatments to the A-likely
controls. Because the intake workers were
quite accurate in their predictions, this
technique is convincing. For mentoring pro-
grams, staff could similarly predict which
youth would likely end up receiving mentors
or which would probably experience long-
term matches based on the information they
gathered during the intake process and their
knowledge of the program. This baseline
(preprogram) characteristic then could be
used to identify a valid comparison.
18
19. Future Directions
Synthesis by making comparisons that undermine the
Good evaluations gauge a program’s impacts balanced nature of treatment and control
on a range of more to less ambitious out- groups. Numerous statistical techniques, such
comes that could realistically change over as the use of instrumental variables, have
the period of observation given the likely been developed to help researchers estimate
program dosage; they assess outcomes using unbiased program impacts. However, their use
measures that are sensitive enough to detect requires forethought at the data collection
the expected or policy-relevant change; and stage to ensure that one has the data needed
they use multiple measures and perspectives to make the required statistical adjustments.
to assess an impact.
Recommendations for Research
The crux of obtaining internally valid impact
Given the aforementioned issues, researchers
estimates is knowing what would have happened
evaluating mentoring programs should con-
to the members of the treatment group had
sider the following suggestions:
they not received mentors. Simple pre/post
designs assume the participant would not have
1. Design for disaster. Assume things will go
changed—that the postprogram behavior would
wrong. Random assignment will be under-
have been exactly what the preprogram behav-
mined. There will be differential attri-
ior was without the program. This is a particu-
tion. The comparison group will not be
larly poor assumption for youth. Experimental
perfectly matched. To guard against these
and quasi-experimental evaluations are more
problems, researchers should think deeply
valid because they use the behavior of the com-
about how the two groups might differ if
parison group to represent what would have
any of these problems were to arise, then
happened (the counterfactual state).
collect data at baseline that could be used
for matching or making statistical adjust-
The internal validity of an evaluation depends
ments. It is also useful to give forethought
critically on the comparability of the treat-
to which program subgroups will be exam-
ment (or participant) and control (or compar-
ined and to collect variables that could
ison) groups. If one can make a plausible case
help predict these program statuses, such
that the two groups differ on a factor that also
as the length of a match.
affects the outcomes, the estimated impact
may be biased by this factor. Because random
2. Gather implementation or process information.
assignment (with sufficiently large samples)
This information is necessary to understand
creates two groups that are statistically equiva-
one’s impact results—why the program had
lent in all observable and unobservable char-
no effect or what type of program had the
acteristics, evaluations with this design are, in
effects that were estimated. These data and
principle, superior to matched comparison
data on program quality also can enable
group designs; matched comparison groups
one to explore what about the program led
can, at best, assure comparability only on the
to the change.
important observable characteristics.
3. Use random assignment or match on motiva-
Evaluators using matched comparison groups
tional factors. Random assignment should
must always worry about potential selection-
be a researcher’s first choice, but if quasi-
bias problems; in practice, researchers con-
experimental methods must be used,
ducting random assignment evaluations
researchers should try to match participant
often run into selection-bias problems too
and comparison youth on some of the less
19
20. obvious factors. The more one can con- 2. Collaborate with local researchers to conduct
vince readers that the groups are equiva- impact studies periodically. When program
lent on all the relevant variables, including staff feel it is time to conduct a more
some of the hard-to-measure factors, such rigorous impact study, they should con-
as motivation or comfort with adults, the sider collaborating with local research-
more credible the impact estimates will be. ers. Given the time, skills and complexity
entailed in conducting impact research,
trained researchers can complete the task
Recommendations for Practice
much more efficiently. An outside evalu-
Given the complexities of computing valid ation also may be believed more readily.
impact estimates, what should a program do Researchers, furthermore, can become
to measure effectiveness? a resource for improving the program’s
ongoing monitoring system.
1. Monitor key process variables or benchmarks.
Walker and Grossman (1999) argued
that not every program should conduct
a rigorous impact study: It is a poor use
of resources, given the cost of research
and the relative skills of staff. However,
programs should use data to improve
their programming (see United Way of
America’s Measuring Program Outcomes 1996
or the W. K. Kellogg Foundation Evaluation
Handbook 2000). Grossman and Johnson
(1999) recommended that mentoring pro-
grams track three key dimensions: youth
and volunteer characteristics, match length,
and quality benchmarks. More specifically,
programs could track basic information
about youth and volunteers: what types
and numbers apply, and what types and
numbers are matched. They could also
track information about how long matches
last—for example, the proportion making
it to various benchmarks. Last, they could
measure and track benchmarks, such as the
quality of the relationship (Rhodes et al.
2005). This approach allows programs to
measure factors that (a) can be tracked eas-
ily and (b) can provide insight about their
possible impacts without collecting data on
the counterfactual state. Pre/post changes
can be a benchmark (but not an impact
estimate), and one must be careful that the
types of youth served and the general envi-
ronment are stable. If the pre/post changes
for cohorts of youth improve over time, for
example, but the program now is serving
less needy youth, the change in this bench-
mark tells little about the effectiveness of the
program (the counterfactual states for the
early and later cohorts differ).
20
21. References
Bloom, H. S. Grossman, J. B. and J. E. Rhodes
1984 “Accounting for No-Shows in 2002 “The Test of Time: Predictors and
Experimental Evaluation Designs.” Effects of Duration in Youth Mentoring
Evaluation Review, 8, 225–246. Programs.” American Journal of Community
Psychology, 30, 199–206.
Branch, A. Y.
2002 Faith and Action: Implementation of the Grossman, J. B. and J. P. Tierney
National Faith-Based Initiative for High-Risk 1998 “Does Mentoring Work? An Impact Study
Youth. Philadelphia: Branch Associates and of the Big Brothers Big Sisters Program.”
Public/Private Ventures. Evaluation Review, 22, 403–426.
Dennis, M. L. Orr, L. L.
1994 “Ethical and Practical Randomized Field 1999 Social Experiments: Evaluating Public
Experiments.” In J. S. Wholey, H. P. Hatry Programs with Experimental Methods.
and K. E. Newcomer, eds., Handbook of Thousand Oaks, CA: Sage.
Practical Program Evaluation. San Francisco:
Jossey-Bass, 155–197. Rhodes, J., R. Reddy, J. Roffman and J. Grossman
2005 “Promoting Successful Youth Mentoring
DuBois, D. L., B. E. Holloway, J. C. Valentine and Relationships: A Preliminary Screening
H. Cooper Questionnaire.” Journal of Primary
2002 “Effectiveness of Mentoring Programs for Prevention, 147-167.
Youth: A Meta-Analytic Review.” American
Journal of Community Psychology, 30, 157– Rosenbaum, P. R. and D. B. Rubin
197. 1983 “The Central Role of the Propensity
Score in Observational Studies for Causal
DuBois, D. L., H. A. Neville, G. R. Parra and Effects.” Biometrika, 70, 41–55.
A. O. Pugh-Lilly
2002 “Testing a New Model of Mentoring.” Rosenberg, M.
In G. G. Noam, ed. in chief, and J. E. 1979 “Rosenberg Self-Esteem Scale.” In K.
Rhodes, ed., A Critical View of Youth Corcoran and J. Fischer (2000). Measures
Mentoring (New Directions for Youth for Clinical Practice: A Sourcebook (3rd ed.).
Development: Theory, Research, and Practice, New York: Free Press, 610–611.
No. 93, 21–57). San Francisco: Jossey-Bass.
Rossi, P. H., H. E. Freeman and M. W. Lipsey
DuBois, D. L. and M. J. Karcher, eds. 1999 Evaluation: A Systematic Approach (6th
2005 Handbook of Youth Mentoring. Thousand edition). Thousand Oaks, CA: Sage.
Oaks, CA: Sage Publications, Inc.
Rubin, D. B.
Dynarski, M., C. Pistorino, M. Moore, 1997 “Estimating Causal Effects from Large
T. Silva, J. Mullens, J. Deke et al. Data Sets Using Propensity Scores.” Annals
2003 When Schools Stay Open Late: The National of Internal Medicine, 127, 757–763.
Evaluation of the 21st Century Community
Learning Centers Program. Washington, DC: Schochet, P., J. Burghardt and S. Glazerman
US Department of Education. 2001 National Job Corps Study: The Impacts
of Job Corps on Participants’ Employment
Eccles, J. S., C. Midgley and T. F. Adler and Related Outcomes. Princeton, NJ:
1984 “Grade-Related Changes in School Mathematica Policy Research.
Environment: Effects on Achievement
Motivation.” In J. G. Nicholls, ed., The
Development of Achievement Motivation.
Greenwich, CT: JAI Press, 285–331.
Grossman, J. B. and A. Johnson
1999 “Judging the Effectiveness of Mentoring
Programs.” In J. B. Grossman, ed.,
Contemporary Issues in Mentoring.
Philadelphia: Public/Private Ventures,
24–47.
21
22. Shadish, W. R., T. D. Cook and D. T. Campbell
2002 Experimental and Quasi-Experimental Designs
for Generalized Causal Inference. Boston:
Houghton Mifflin.
Stock, J. H. and M. W. Watson
2003 Introduction to Econometrics. Boston:
Addison-Wesley.
Tierney, J. P., J. B. Grossman and N. L. Resch
1995 Making a Difference: An Impact Study of Big
Brothers/Big Sisters. Philadelphia: Public/
Private Ventures.
United Way of America
1996 Measuring Program Outcomes. Arlington, VA:
United Way of America.
Walker, G. and J. B. Grossman
1999 “Philanthropy and Outcomes: Dilemmas
in the Quest for Accountability.” In C. T.
Clotfelter and T. Ehrlich, eds., Philanthropy
and the Nonprofit Sector in a Changing
America. Bloomington: Indiana University
Press, 449–460.
Weiss, C. H.
1998 Evaluation. Upper Saddle River, NJ:
Prentice Hall.
W. K. Kellogg Foundation
2000 W.K. Kellogg Foundation Evaluation
Handbook. Battle Creek, MI: W. K. Kellogg
Foundation.
22
23. Public/Private Ventures
2000 Market Street, Suite 600
Philadelphia, PA 19103
Tel: (215) 557-4400
Fax: (215) 557-4469
New York Office
The Chanin Building
122 East 42nd Street, 42nd Floor
New York, NY 10168
Tel: (212) 822-2400
Fax: (212) 949-0439
California Office
Lake Merritt Plaza, Suite 1550
1999 Harrison Street
Oakland, CA 94612
Tel: (510) 273-4600
Fax: (510) 273-4619
www.ppv.org