Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Handout #9 - QIAMay4


Published on

Evaluating Mentoring Programs, P/PV Brief

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Handout #9 - QIAMay4

  1. 1. Public/Private Ventures BriefEvaluating Mentoring Programs Jean Baldwin Grossman September 2009
  2. 2. Public/Private Ventures Board of Directors Research Advisory is a national leader in creating and strength- Committee ening programs that Matthew T. McGuire, Chair Jacquelynne S. Eccles, Chairimprove lives in low-income communities. We Principal University of Michigando this in three ways: Origami Capital Partners, LLC Robert Granger Yvonne Chan William T. Grant Foundationinnovation Principal Robinson HollisterWe work with leaders in the field to identify Vaughn Learning Center Swarthmore Collegepromising existing programs or develop new The Honorable Renée Reed Larsonones. Cardwell Hughes University of Illinois Judge, Court of Common Pleas Jean E. Rhodesresearch The First Judicial District, University of Massachusetts,We rigorously evaluate these programs to Philadelphia, PA Bostondetermine what is effective and what is not. Christine L. James-Brown Thomas Weisner President and CEO UCLAaction Child WelfareWe reproduce model programs in new League of Americalocations, provide technical assistance Robert J. LaLonde Professorwhere needed and inform policymakers and The University of Chicagopractitioners about what works. John A. Mayer, Jr.P/PV is a 501(c)(3) nonprofit, nonpartisan Retired, Chief Financial Officer J. P. Morgan & Co.organization with offices in Philadelphia, New Anne Hodges MorganYork City and Oakland. For more information, Consultant to Foundationsplease visit Siobhan Nicolau President Hispanic Policy Development Project Marion Pines Senior Fellow Institute for Policy Studies Johns Hopkins University Clayton S. Rose Senior Lecturer Harvard Business School Cay Stratton Special Adviser UK Commission for Employment and Skills Sudhir Venkatesh William B. Ransford Professor of Sociology Columbia University William Julius Wilson Lewis P. and Linda L. Geyser University Professor Harvard University2
  3. 3. AcknowledgmentsThis brief is a revised version of a chapter written for the Handbook of Youth Mentoring edited byDavid DuBois and Michael Karcher (2005). David and Michael provided many useful commentson the earlier version. Laura Johnson and Chelsea Farley of Public/Private Ventures helpedrevise the chapter to make it more accessible to non-mentoring specialists and provided greateditorial advice.Additional reference: DuBois, D. L. and M. J. Karcher, eds. 2005. Handbook of Youth Mentoring.Thousand Oaks, CA: Sage Publications, Inc.3
  4. 4. IntroductionQuestions about mentoring abound. This article presents discussions of many issuesMentoring programs around the country that arise in answering both implementationare being asked by their funders and boards, or process questions and impact questions.“Does this mentoring program work?” Process questions are important to addressPolicymakers ask, “Does this particular type even if a researcher is interested only inof mentoring—be it school-based or group or impacts, because one should not ask, “Does itemail—work?” These are questions about pro- work?” unless “it” actually occurred. The firstgram impacts. Researchers and operators also section covers how one chooses appropriatewant to know about the program’s processes: process and impact measures. The next sec-What about mentoring makes it work? How tion discusses several impact design issues,long should a match last to be effective? How including the inadequacies of simple pre/frequently should matches meet? Does the post designs, the importance of a good com-level of training, support or supervision of parison group and several ways to constructthe match matter? Does parental involvement comparison groups. The last section discussesor communication matter? What types of common mistakes made when analyzinginteractions between youth and mentors lead evaluation data and presents ways to avoidto positive changes in the child? Then there them. For a more complete discussion ofare questions about the populations served evaluation in general, readers are referred toand what practices are most effective: Are par- Rossi et al. (1999); Shadish et al. (2002); andticular types of youth more affected by men- Weiss (1998). Due to space constraints, issuestoring than others? Are mentors with specific entailed in answering mediational questionscharacteristics, such as being older or more are not addressed here.educated, more effective than other mentorsor more effective with particular subgroupsof youth? Finally, researchers in particular areinterested in the theoretical underpinningof mentoring. For example, to what degreedoes mentoring work by changing children’sbeliefs about themselves (such as boostingself-esteem or self-efficacy), by shaping theirvalues (such as their views about educationand the future) or by improving their socialand/or cognitive skills?4
  5. 5. Measurement IssuesA useful guide in deciding what to measure is mentoring programs have more detaileda program’s logic model or theory of change: ideas, such as wanting participants to experi-the set of hypothesized links between the pro- ence specific program elements (academicgram’s action, participants’ response and the support, for example, or peer interaction). Ifdesired outcomes. As Weiss states, with such these are critical components of the programa theory in hand, “The evaluation can trace theory, they also make good candidates forthe unfolding of the assumptions” (1998, 58). process measures.Rhodes et al. (2005) presents one possibletheory of change for mentoring: Process mea- A second level of process question concernssures describe the program’s actions; outcome the quality of the components: How goodmeasures describe what effects the program has. are the relationships? Are the training and supervision useful? These are more difficult dimensions to measure. Client satisfactionProcess Measures measures, such as how much youth like theirThe first question when examining a pro- mentors or how useful the mentors feel thegram is: What exactly is the program as training is, are one gauge of quality. However,experienced by participants? The effect the clients’ assessment of quality may not beprogram will have on participants depends accurate; as many teachers say, the most enjoy-on the realities of the program, not on its able class may not be the class that promotesofficial description. All too frequently in the most learning. Testing mentors beforementoring programs, relatively few strong and after training is an alternative qualityrelationships form and matched pairs stop measure. Assessing the quality of mentoringmeeting. Process questions can be answered, relationships is a relatively unexplored area.however, at several levels. Most basically, Grossman and Johnson (1999) and Rhodes etone wants to know: Did the program recruit al. (2005) propose some measures.appropriate youth and adults? Did adults andyouth meet as planned? Did all the compo- From a program operator’s or funder’s per-nents of the program happen? Were mentors spective, how much process informationtrained and supervised as expected? is “enough” depends on striking a balance between knowing exactly what is happeningTo address these questions, one examines in the program versus recognizing the servicethe characteristics and experiences of the the staff could have provided in lieu of collect-participants, mentors and the match, and ing data. Researchers should assess enoughcompares them with the program’s expecta- implementation data to be sure the programtions. For example, a mentoring program is actually delivering the services it purports totargeting youth involved in criminal or vio- offer at a level and quality consistent with hav-lent activity tracked the number of arrests of ing a detectable impact before spending thenew participants to determine whether they time and money to collect data on outcomes.were serving their desired target populations Even if no impact is expected, it is essential to(Branch 2002). A high school mentoring know exactly what did or did not happen toprogram for struggling students tracked the the participants to understand one’s findings.GPAs of enrolled youth (Grossman, Johnson Thus, researchers may want to collect more1999). Two match characteristics commonly process data than typically would be collectedexamined are the average completed length by operators to improve both the quality ofof the relationship and the average frequency their generalizations and their ability to linkof interaction. Like all good process mea- impacts to variation in participants’ experi-sures, they relate to the program’s theory. To ences of core elements of the affected, a participant must experience asufficient dosage of the intervention. Some5
  6. 6. Lesson: Tracking process measures is impor- the false impression that the program is a fail-tant to program managers but essential for ure, when in fact the impacts on the chosenevaluators. Before embarking on an evalua- variables may not yet have emerged.tion of impacts, be sure the program is deliv-ering its services at a quality and intensity that A good technique for selecting variables iswould lead one to expect impacts. to choose a range of proximal to more distal expected impacts based on the program’s the- ory of change, which also represents a set ofOutcome Measures impacts ranging from modestly to impressivelyAn early task for an impact evaluator is to effective (Weiss 1998). Unfortunately, one can-refine the “Does it work?” question into a not know a priori how long matches will lastset of testable evaluation questions. These or how often the individuals will meet. Thus,questions need to specify a set of outcome it is wise to include some outcomes that arevariables that will be examined during the likely to change even with rather limited expo-evaluation. There are two criteria for a good sure to the intervention, and some outcomesoutcome measure (Rossi et al. 1999). First, that would change with greater exposure,the outcome can be realistically expected to thus setting multiple “bars.” The most basicchange during the study period given the effectiveness goal is an outcome that everyoneintensity of the intervention. Second, the out- agrees should be achievable. From there, onecome is measurable and the chosen measure can identify more ambitious outcomes.sensitive enough to detect the likely change. Public/Private Ventures’ evaluation of BigEvaluation questions are not program goals. Brothers Big Sisters (BBBS) provides a goodMany programs rightly have lofty inspira- example of this process (Grossman andtional goals, such as enabling all participants Tierney 1998). Researchers conducted a thor-to excel academically or to become self- ough review of BBBS’s manual of standardssufficient, responsible citizens. However, a and practices to understand the program’sgood evaluation outcome must be concrete, logic model and then, by working closely withmeasurable and likely to change enough dur- staff from the national office and local agen-ing the study period to be detected. Thus, for cies, generated multiple outcome bars. Theexample, achieving a goal like “helping youth national manual lists four “common” goalsacademically excel” could be gauged by exam- for a Little Brother or Little Sister: providingining students’ grades or test scores. social, cultural and recreational enrichment; improving peer relationships; improving self-In addition, when choosing the specific set of concept; and improving motivation, attitudeoutcomes that will indicate a goal such as “aca- and achievement related to schoolwork.demically excelling,” one must consider which Conversations with BBBS staff also suggestedof the possible variables are likely to change that having a Big Brother or Big Sister couldgiven the program dosage participants will reduce the incidence of antisocial behav-probably receive during the evaluation period. iors such as drug and alcohol use and couldFor example, researchers often have found improve a Little Brother’s or Little Sister’sthat reading and math achievement test scores relationship with his or her parent(s). Usingchange less quickly than do reading or math previous research, the hypothesized impactsgrades, which, in turn, change less quickly were ordered from proximal to distal as fol-than school effort. Thus, if one is evaluating lows: increased opportunities for social andthe school-year (i.e., nine months) impact of cultural enrichment, improved self-concept,a school-based mentoring program, one is better relationships with family and friends,likely to want to examine effort and grades improved academic outcomes and reducedrather than test scores, or at least in addition antisocial test scores. Considerable care and thoughtneed to go into deciding what outcomes data At a minimum, the mentoring experience wasshould be collected and when. Examining expected to enrich the cultural and social lifeimpacts on outcomes that are unlikely to of youth, even though many more impactschange during the evaluation period can give were anticipated. Because motivational psy- chology research shows that attitudes often change before behaviors, the next set of6
  7. 7. outcomes reflected attitudinal changes toward Information from each source has advantagesthemselves and others. The “harder” academic and disadvantages. For example, for someand antisocial outcomes then were specified. variables, such as attitudes or beliefs, the youthWithin these outcomes, researchers also may be the only individual who can providehypothesized a range of impacts, from attitu- valid information. Youth, for example, arguablydinal variables, such as the child’s perceived are uniquely qualified to report on constructssense of academic efficacy and value placed such as their self-esteem (outcome measures)on education, to some intermediate behav- or considerations such as how much they likeioral changes, such as school attendance and their mentors or whether they think their men-being sent to the principal’s office, to changes tors support and care for them (process mea-in grades, drug and alcohol use, and fighting. sures). Theoretically, what may be important is not what support the mentor actually gives butOnce outcomes are identified, the next ques- how supportive the youth perceives the mentortion is how to measure them. Two of the most to be (DuBois et al. 2002).important criteria for choosing a measure arewhether the measure captures the exact facet On the other hand, youth-reported data mayof the outcome that the program is expected be biased. First, youth may be more likely toto affect and whether it is sensitive enough to give socially desirable answers—recountingpick up small changes. For example, an aca- higher grades or less antisocial behavior.demically focused mentoring program that If this bias is different for mentored versusclaims to increase the self-esteem of youth may nonmentored youth, impact estimates basedhelp youth feel more academically competent on these variables could be biased. Second,but not improve their general feelings of self- the feelings of youth toward their mentorsworth. Thus, one would want to use a scale may taint their reporting. For example, if thetargeting academic self-worth or competence youth does not like the mentor’s style, he orrather than a global self-worth scale—or select she may selectively report or overreport cer-a scale that can measure both. The second tain negative experiences, such as the mentorconsideration is the measure’s degree of sen- missing meetings, and underreport others ofsitivity. Some measures are extremely good at a more positive nature, such as the amount ofsorting a population or identifying a subgroup time the mentor spends providing help within need of help but poor in detecting the small schoolwork. Similarly, the youth may overstatechanges that typically result from programs. a mentor’s performance to make the mentorFor example, in this author’s experience, the look good. Last, the younger the child is, theRosenberg self-esteem scale (1979) is useful in less reliable or subtle the self-report. For thisdistinguishing adolescents with high and low reason, when participants are quite young (8self-esteem but often is not sensitive enough to or 9 years old), it is advisable to collect infor-detect the small changes in self-esteem induced mation from their parents and/or most youth programs. On the other hand,measures of academic or social competency The mentor often can be a good source ofbeliefs (Eccles et al. 1984) can detect relatively information about what the mentoring expe-small changes. rience is like, such as what the mentor and mentee do and talk about (process measures),Lesson: Choose outcomes that are integrally and as a reporter on the child’s behaviors atlinked to the program’s theory of change, that posttest (outcome measures). The main prob-establish multiple “effectiveness bars,” that are lem with mentor reporting is that mentorsgauged with sensitive measures and that can be have an incentive to report positively on theirachieved within the evaluation’s time frame and relationships with youth and to see effectsin the context of the program’s implementation. even if there are none, justifying why they are spending time with the child. Although there may be a positive bias, this does not precludeChoosing Informants mentors’ being accurate in reporting rela-Another issue to be resolved for either process tive impacts. This is because most mentors door outcome measures is from whom to collect not report that their mentees have improvedinformation. For mentoring programs, the equally in all areas. The pattern of differ-candidates are usually the youth, the mentor, ence in these reports, especially if consistenta parent, teachers and school records.7
  8. 8. with those obtained from other sources, suchas school records, may provide useful infor-mation about the true impacts.Parents also can be useful as reporters. Theymay notice that the child is trying harder inschool, for example, even though the childmight not notice the change. However, likethe mentor, parents may project changes thatthey wish were happening or be unaware ofcertain behaviors (e.g., substance use).Finally, teachers may be good reporters onthe behaviors of their students during theschool day. Teachers who are familiar withage-appropriate behavior, for example, mayspot a problem when a parent or mentor doesnot. However, teachers are extraordinarilybusy, and it can be difficult for them to findthe time to fill out evaluation forms on theparticipants. In addition, teachers too are notimmune to seeing what they want to see, andas with mentors and parents, the caveat aboutrelative impacts applies here.Information also can be collected fromrecords. Data about the occurrence of spe-cific events—fights, cut classes, principalvisits—are less susceptible to bias, unless thesources of these data (e.g., school adminis-trators making discipline decisions) differ-entially judge or report events for mentoredyouth versus other youth.Lesson: Each respondent has a unique pointof view, but all are susceptible to reportingwhat they wish had happened. Thus, if timeand money allow, it is advantageous to exam-ine multiple perspectives on an outcome andtriangulate on the impacts. What is importantis to see a consistent pattern of impacts (notuniform consistency among the respondents).The more consistency there is, the morecertain one can be that a particular impactoccurred. For example, if the youth, parentand teacher data all indicate school improve-ment and test scores also increase, this wouldbe particularly strong evidence of academicgains. Conversely, if only one of these mea-sures exhibits change (e.g., parent reports), itcould be just a spurious finding.8
  9. 9. Design IssuesAnswering the questions “Does mentoring in the absence of the program, one mustwork?” and “For whom?” may seem relatively identify another group of youth, namely astraightforward—achievable simply by observ- comparison group, whose behavior will rep-ing the changes in mentees’ outcomes. But resent what the participants’ behavior wouldthese ostensibly simple questions are harder have been without the program. Choosingto answer than one might assume. a group whose behavior accurately depicts this hypothetical no-treatment (or coun- terfactual) state is the crux of getting theThe Fallacy of Pre/Post Comparisons right answer to the effectiveness question,The changes we observe in the attitudes, because a program’s impacts are ascertainedbehaviors or skills of youth while they are by comparing the behavior of the treatmentbeing mentored are not equivalent to program or participant group with that of the selectedimpacts. How can that be? The answer has to comparison with what statisticians call internal valid-ity. Consider, again, the previously describedBBBS evaluation. If one looks only at changes Matched Comparison or Control Groupin outcomes for treatment youth (Grossman, ConstructionJohnson 1999), one finds that 18 months after There are two principal types of comparisonthey applied to the program, 7 percent had groups: control groups generated throughreported starting to use drugs. On the face of random assignment and matched comparisonit, it appears that the program was ineffective; groups selected judgmentally by the researcher.however, during the same period, 11 percent ofthe controls had reported starting to use drugs. Experimental Control GroupsThus, rather than being ineffective, this statisti-cally significant difference indicates that BBBS Random assignment is the best way to createwas able to stem some of the naturally occur- two groups that would change comparablyring increases in drug use. over time. In this type of evaluation, eligible individuals are assigned randomly, either toThe critical distinction here is the difference the control group and not allowed into thebetween outcomes and impacts. In evalua- program, or to the treatment group, whosetion, an outcome is the value of any variable members are offered the program. (Note: “Treatments” and “controls” refer to randomlymeasured after the intervention, such as selected groups of individuals. Not all treat-grades. An impact is the difference between ments may choose to participate. The termthe outcome observed and what it would “participants” is used to refer to individualshave been in the absence of the program who actually receive the program.)(Rossi et al. 1999); in other words, it is thechange in the outcome that was caused by The principal advantage of random assign-the program. Simple changes in outcomes ment is that given large enough groups,may in part reflect the program’s impact on average, the two groups are statisticallybut also might reflect other factors, such as equivalent with respect to all characteristics,changes due to maturation. observed and unobserved, at the time the two groups are formed. If nothing were doneLesson: A program’s impact can be gauged to either group, their behaviors, on average,accurately (i.e., be internally valid) only if would continue to be statistically equivalentone knows what would have happened to at any point in the future. Thus, if after thethe participants had they not been in the intervention the average behavior of the twoprogram. This hypothetical state is called groups differs, the difference can be confi-the “counterfactual.” Because one cannot dently and causally linked to the program,observe what the mentees would have done9
  10. 10. which was the only systematic difference Matched (or Quasi-Experimental)between the two groups. See Orr (1999) for a Comparison Groupsdiscussion of how large each group should be. Random assignment is not always possible. For example, programs may be too small or staffAlthough random assignment affords the may refuse to participate in such an evalu-most scientifically reliable way of creating two ation. When this is the case, researchers mustcomparable groups, there are many issues that identify a group of nonparticipant youth whoseshould be considered before using it. Two of outcomes credibly represent what would havethe most difficult are “Can random assign- happened to the participants in the absence ofment be inserted into the program’s normal the program. The weakness of the methodol-process without qualitatively changing the pro- ogy is that the outcomes of the two groups cangram?” and “Is it ethical to deny certain youth differ not only because one group got a men-a mentor?” However, it is worth noting that tor and the other did not but also because ofall programs ration their services, primarily other differences between the groups. To gen-by not advertising to more people than they erate internally valid estimates of the program’scan serve. Random assignment gives all needy impacts, one must control for the “other dif-children an equal probability of being served, ferences” either through statistical proceduresrather than denying children who need a such as regression analysis and/or throughmentor by not telling them about the pro- careful matching.gram. The reader is referred to Dennis (1994)for a detailed discussion of the ethical issues The researcher selects a comparison groupinvolved in random assignment. of youth who are as similar as possible to the participant group across all the importantWith respect to the first issue, consider first characteristics that may influence outcomeshow the insertion of random assignment into in the counterfactual state (the hypotheti-the intake process affects the program. One cal no-treatment state). Some key charac-of the misconceptions about random assign- teristics are relatively easy to identify andment among mentoring staff is that it means match for (e.g., age, race, gender or familyrandomly pairing youth with adults. This is not structure). However, to improve the cred-the case. Random pairing would fundamentally ibility of a matched comparison group, onechange the program, and any evaluation of needs to think deeply about other potentialthis altered program would not provide infor- differences that could affect the outcome dif-mation on the effect of the actual program. A ferential, such as whether one group of youthvalid use of random assignment would entail comes from families that care enough andrandomly dividing eligible applicants between are competent enough to search out servicesthe treatment and control groups, then pro- for their youth, or how comfortable the youthcessing the treatment group youth just as they are with adults. These critical yet hard-to-normally would be handled and matched. measure variables are factors that are likely toUnder this design, random assignment affects systematically differ between participant andonly which youth files come across the staff’s comparison group youth and to substantiallydesk for matching, not what happens to youth affect one or more of the outcomes beingonce they are there. Another valid test would examined. The more readers of an evaluationinvolve identifying two youth for every volun- can think of such variables that have not beenteer, then randomly assigning one child to the accounted for, the less they will believe thetreatment group and one to the control group. resulting program impact estimates.For the BBBS evaluation, we used the formermethod because it was significantly less bur- Consider, for example, an email mentor-densome and emotionally more acceptable for ing program. Not only would one want thethe staff. However, the chosen design meant comparison group to match the participantthat not all treatment youth actually received group on demographic characteristics—agea mentor. As will be discussed later, only (say, 12, 13, 14 or 15 years old), gender (male,about three quarters of the youth who were female), race (white, Hispanic, black) andrandomized into the treatment group and income (poor, nonpoor)—but one mightoffered the program actually received men- also want to match the two groups on theirtors. (See Orr 1999 for a rich discussion of preprogram use of the computer, such asall aspects of random assignment.) the average number of hours per week spent10
  11. 11. using email or playing computer games. To probabilities are calculated for both par-match on this variable, however, one would ticipants and all potential nonparticipants.have to collect computer use data on many Each participant then is matched with one ornonparticipant youth to find those most com- more nonparticipant youth based on theseparable to the participants. predicted propensity scores. For example, for each participant, the nonparticipant with theWhen one has more than a few matching vari- closest predicted participation probability canables, the number of cells becomes too numer- be selected into the comparison group. (Seeous. In the above example, we would have Shadish et al. 2002, 161–165, for further dis-4 age × 2 gender × 3 race × 2 income, or 48 cussion of PSM, and Dynarski et al. 2003 forcells, even before splitting by computer use. A an application in a school-based setting.)method that is used with increasing frequencyto address this issue is propensity score match- An implication of this technique is that oneing (PSM). A propensity score is the probability needs data for the propensity score logit fromof being a participant given a set of known fac- both the participant group and a large pool oftors. In simple random assignment evaluations, nonparticipant youth who will be consideredthe propensity score of every sample member is for inclusion in the comparison group. The50 percent, regardless of his or her characteris- larger the considered nonparticipant pool is,tics. In the real world, without random assign- the more likely it is that one can find a closement, the probability of being a participant propensity score match for each participant.depends on the individual’s characteristics, This data requirement often pushes research-such as his or her comfort with computers in ers to select matching factors that are readilythe example above. Thus, participants and available through records rather than incurnonparticipants naturally differ with regard to the expense of collecting new data.many characteristics. PSM can help researchersselect which nonparticipants best match the One weakness of this method is that althoughparticipant group with respect to a weighted the propensity to participate will be quite simi-average of all these characteristics (where the lar for the participant and comparison groups,weights reflect how important the factors are in the percentage with a particular characteris-making the individual a participant). tic (such as male) may not be, because PSM matches on a linear combination of character-To calculate these weights, the researcher istics, not each characteristic one by one. Toestimates, across both the participant and overcome this weakness, most studies matchnonparticipant samples, a logistic model of propensity scores within a few demographi-the probability of being a participant (Pi) cally defined cells (such as race/gender).as a function of the matching variables andall other factors that are hypothesized to be PSM also balances the two groups only on therelated to participation (Rosenbaum, Rubin factors that went into the propensity score1983; Rubin 1997). For example, if one were regression. For example, the PSM in Dynarskievaluating a school-based mentoring program, et al. (2003) was based on data gathered fromthe equation might include age, gender, race, 21,000 students to generate a comparisonhousehold status (HH) and reduced-price- group for their approximately 2,500 partici-lunch status (RL), as well as past academic pants. However, when data were collected later(GPA) and behavior (BEH) assessments, as is on parents, it turned out that comparisonshown in Equation 1 below. Obtaining teacher group students were from higher-income fami-ratings of the youth’s interpersonal skills lies. No matter how carefully a comparison(SOC) also would help match on the youth’s group is constructed, one can never know forability to form a relationship. sure how similar this group is to the participant group on unmeasured characteristics, such as(1) Pi = f(age, gender, race, HH, RL, GPA, BEH, SOC) their ability to respond to adult guidance.The next step of PSM is to compute for eachpotential member of the sample the prob-ability of participation based on the matchingcharacteristics in the regression. Predicted11
  12. 12. Lesson: How much a reader trusts the internalvalidity of an evaluation depends on how muchhe or she trusts that the comparison grouptruly is similar to the participant group on allimportant dimensions. This level of trust orconfidence is quantifiable in random assign-ment designs (e.g., one is 95 percent confidentthat the two groups are statistically equivalent),whereas with a quasi-experimental design, thislevel of trust is uncertain and unquantifiable.12
  13. 13. AnalysisThis section covers how impact estimates are (unmeasured factors). Another way to thinkderived, from the simplest techniques to more of b is that it is basically the difference in thestatistically sophisticated ones. Several com- mean Ys, adjusting for differences in Xs.monly committed errors and techniques usedto overcome these problems are presented. When data are from a quasi-experimental evaluation, it is always best to estimate impacts using regression or analysis of covariance;The Basics of Impact Estimates not only does one get more precise estimates,Impact estimates for both experimental and but one can control for any differences thatquasi-experimental evaluation are basically do arise between the participant and thedetermined by contrasting the outcomes of comparison groups. Regression simulatesthe participant or treatment group with those what outcomes youth who were exactly likeof the control or comparison group. If one participants on all the included characteris-has data from a random assignment design, tics (the Xs) would have had if they had notthe simplest unbiased impact estimate is the received a mentor, assuming that all factorsdifference in mean follow-up (or posttest) out- that jointly affect participation and outcomescomes for the treatment and control groups, are included in the regression. Regressionsas in Equation 2, are also useful in randomized experiments for estimating impacts more precisely.(2) b = Mean(Yfu,T) − Mean(Yfu,C)where b is the estimated impact of the pro- Suspicious Comparisonsgram, Yfu,T is the value of outcome Y at post- The coefficient b from Equation 3 is an unbi-test or follow-up for the treatment group ased estimate of the program’s impact (i.e.,youth, and Yfu,C is the value of outcome Y at the estimate differs from the true impactposttest or follow-up for the control group only by a random error with mean of zero)youth. One can increase the precision of the as long as the two groups are identical on allimpact estimate by calculating the change- characteristics (both included and excludedscore or difference-in-difference estimator as variables). The key to obtaining an unbiasedin Equation 3, estimate of the impact is to ensure that one compares groups of youth that are as similar(3) b = Mean(Yfu,T − Ybl,T) – Mean(Yfu,C − Ybl,C) as possible on all the important observable and unobservable characteristics that influ-where Ybl,T is the value of outcome Y at base- ence outcomes. Although many researchersline for the treatment group youth, and Ybl,C understand the need for comparability andis the value of outcome Y at baseline for the indeed think a lot about it when construct-control group youth. ing a matched comparison group, this pro- found insight is often forgotten in the analysisEven more precision can be gained if the phase, when the final comparisons are made.researcher controls for other covariate factors Most notably, if one omits youth from eitherthat affect the outcome through the use of group—the randomly selected treatment (orregression, as in Equation 4, self-selected participant) group or the ran- domly selected control (or matched compari-(4) Yfu = a + bT + cYbl + dX + u son) group—the resulting impact estimate is potentially biased. Following is a list of com-where b is the estimated impact of the pro- monly seen yet flawed comparisons related togram, T is a dummy variable equal to 1 for this concern.treatments and 0 for controls, and X is a vec-tor of baseline covariates that affect Y and u13
  14. 14. Suspect Comparison 1: Comparing groups of On the other hand, the estimate based on allyouth based on their match status, such as compar- treatments and all controls, called the “intent-ing those who received a mentor or youth whose to-treat effect,” is unaffected by this bias.matches lasted at least one month with the controlor comparison group. Suppose, as occurred Because the intent-to-treat estimate is basedin the Public/Private Ventures evaluation on the outcomes of all of the treatment youth,of BBBS’s community-based mentoring whether or not they received the program,program, only 75 percent of the treatment it may underestimate the “impact on thegroup actually received mentors (Grossman, treated” (i.e., the effect of actually receivingTierney 1998). Can one compare the out- the treatment). A common way to calculatecomes of the 75 percent who were mentees the “impact on the treated” is to divide thewith the controls to get an unbiased estimate intent-to-treat estimate by the proportionof the program’s impact? No. All the impact of youth actually receiving the programestimates must be based on comparisons (Bloom 1984). The intent-to-treat estimate is abetween the entire treatment group and the weighted average of the impact on the treatedentire control group to maintain the com- youth (ap) and the impact on the untreatedplete comparability of the two groups. (This youth (anp), as shown in Equation 5,estimate often is referred to as the impact ofthe “intent to treat.”) (5) Mean(T) − Mean(C) = a = p ap + (1 −p) anpThere are undoubtedly factors that are sys- where p = proportion treated.tematically different between youth who formmentoring relationships and those who do If the effect of group assignment on thenot. The latter youth may be more difficult untreated youth (anp) is zero (i.e., untreatedtemperamentally, or their families may have treatment individuals are neither hurtdecided they really did not want mentors nor helped), then a is to equal a/p. Let’sand withdrew from the program. If research- again take the example of the BBBS evalu-ers remove these unmatched youth from the ation. Recall that 18 months after randomtreatment group but do nothing with the assignment, 7 percent of the treatmentcontrol group, they could be comparing the group youth (the treated and untreated)“better” treatment youth with the “average” had started using drugs, compared withcontrol group child, biasing the impact esti- 11 percent of the control group youth, amates. Randomization ensures that the treat- 4-percentage-point reduction. Using thement and control groups are equivalent (i.e., knowledge that only 75 percent of the youththere are just as many “better” youth in the actually received mentors, the “impact oncontrol group as the treatment group). After the treated” of starting to use drugs wouldthe intervention, matched youth are read- increase from a 4-percentage-point reductionily identified. Researchers, however, cannot to a 5.3-percentage-point reduction (= 4/.75).identify the control group youth who wouldhave been matched successfully had they been Similar bias occurs if one removes con-given the opportunity. Thus, if one discarded trol group members from the comparison.the unmatched treatment youth, implicitly Reconsider the school-based mentoringone is comparing successfully matched youth example described above, where treatmentto a mixed group—those for whom a match youth are offered mentors and control youthwould have been found (had they been offered are denied mentors for one year. Supposeparticipation) and those for whom matches that although most youth participate for onlywould not be found (who are perhaps harder a year, some continue their matches into ato serve). An impact estimate based on such a second school year. To gauge the impact ofcomparison has the potential to bias the esti- this longer intervention, the evaluator mightmate in favor of the program’s effectiveness. (incorrectly) consider comparing youth who(The selection bias embedded in matching is had mentors for two years with control youththe reason researchers might choose to com- who were not matched after their one-yearpare the outcomes of a matched comparison denial period. This comparison has severalgroup with the outcomes of mentoring pro- problems. Youth who were able to sustaingram applicants, rather than participants.) their relationships into a second year, for example, would likely be better able to relate14
  15. 15. to adults and perhaps more malleable to a with negative outcomes disappear, whilementoring intervention than the “average” the indications of positive effects of longeroriginally matched comparison group mem- matches remained.ber. An unbiased way to examine these pro-gram impacts would be to compare groups A similar problem occurs when comparingthat were assigned randomly at the beginning youth with close relationships with those withof the evaluation: one group being offered the weaker relationships. For the straightforwardpossibility of a two-year match and the other comparison to be valid, one is implicitlybeing denied the program for two years. To assuming that youth who ended up with closeinvestigate both one- and two-year versions of relationships with their mentors would have,the program, applicants would need to be ran- in the absence of the program, fared equallydomized into one of three groups: one group well or poorly as youth who did not end upoffered the possibility of a two-year match, with close relationships. If those with closerone group offered the possibility of a one-year relationships would have, without the pro-match and one group denied the program for gram, been better able to secure adult atten-the full two years. tion than the other youth and done better because of it, for example, then a comparisonLesson: The only absolutely unbiased estimate of the close-relationship youth with eitherfrom a random assignment evaluation of a youth in less-close relationships or with thementoring program is based on the compari- control/matched comparison group couldson of all treatments and all controls, not just be flawed.the matched treatments or those matched forparticular lengths of time. Lesson: Any examination of groups defined by a program variable—such as having a mentor,Suspect Comparison 2: Comparing effects based on the length of the relationship, having a cross-relationship characteristics, such as short matches race match—is potentially plagued by selec-with longer matches or closer relationships with tion bias regardless of the evaluation designless close relationships. Grossman and Rhodes employed. Valid subgroup estimates can be(2002) examined the effects of different calculated only for subgroups defined on pre-lengths of matches using the BBBS evalua- program characteristics, such as gender or racetion data. In the first part of the paper, the or preprogram achievement levels or grades. Inresearchers reported the straightforward these cases, we can precisely identify and makecomparisons between outcomes of those comparisons to a comparable subgroup withinmatched less than 6 months, 6 to 12 months the control group (against which the treatmentand more than 12 months with the control subgroup may be compared).group’s outcomes. Although interesting,these simple comparisons ignore the poten- Suspect Comparison 3: Comparing the outcomes oftial differences among youth who are able to mentored youth with a control or matched compari-sustain their mentoring relationships for dif- son group when the sample attrition at the follow-upferent periods of time. If the different match assessment is substantial or, worse yet, when therelengths were induced randomly across pairs is differential attrition between the two groups.or the reasons for a breakup were unrelated Once again, unless those who were assessedto the outcomes being examined, then there at posttest were just like the youth for whomwould be no problem with the simple set of one does not have posttest data, the impactcomparisons. However, if, for example, youth estimates may be biased. Suppose youth fromwho cannot form relationships that last more the most mobile, unstable households are thethan five months are less able to get the adult ones who could not be located. Comparingattention and resources they need and conse- the “found” treatment and controls only pro-quently would do worse than longer-matched vides information about the impact of theyouth even without the intervention, then the program on youth from stable homes, not allfirst set of comparisons would produce biased youth. This is an issue of generalizability (i.e.,impact estimates. Indeed, when the research- external validity; see Shadish et al. 2002).ers statistically controlled for this potentialbias (using two-staged least squares regres- Differential attrition between the treat-sion, as discussed below), they saw evidence ment and the control (or participant andof the strong association of short matches comparison) groups is important because15
  16. 16. it also poses a threat to internal validity. Statistical Corrections for BiasesFrequently, researchers are able to reassess What if one wants to examine programa much higher fraction of program partici- impacts under these compromised situa-pants—many of whom may still be meeting tions—such as dealing with differential attri-with their mentors—than of the control or tion or examining the impact of mentoringcomparison group youth (whom no one has on youth whose matches have lasted morenecessarily tracked on a regular basis). For than a year? There are a variety of statisticalexample, if the control or comparison group methods to handle these biases. As long as theyouth demonstrate increased behavioral or assumptions underlying these methods hold,academic problems over the sample period, then the resulting adjusted impact estimatesparents may move their children to attend should be unbiased.other schools and thus make data collectionmore difficult. Alternatively, some treatment Let’s start by restating the basic hypothesizedfamilies may have decided not to move out of model:the area because the children had good men-tors. Under any of these scenarios, comparing (6) Yfu = a + bM + cYbl + dX + uthe reassessed comparison group youth withreassessed mentees could be a comparison of The value of outcome Yfu is determined by itsunlike individuals. value at baseline (Ybl), whether the child got mentoring (M), a vector of baseline covariatesTechnically, any amount of attrition—even that affect Y (X) and unmeasured factors (u).if it is equal across the two groups—puts Suppose one has information on a group ofthe accuracy of the impact estimates into mentees and a comparison group of youthquestion. The treatment group youth who matched on age, gender and school. Nowcannot be located may be fundamentally suppose, however, the youth who actually getdifferent from control group youth who can- mentors differ from the comparison youthnot be located. For example, the control in that they are more likely to be firstborn. Ifattriters might be the youth whose parents firstborn youth do better on outcome Y (evenenroll them in new schools because they are controlling for the baseline level of Y) andnot doing well, while the treatment attriters one fails to control for this difference, themight be the youth whose parents moved. estimated impact coefficient (b) will be biasedHowever, as long as one can show that the upward, picking up not only the effect ofbaseline characteristics of the two groups are mentoring on Y but also the “firstborn-ness”similar, most readers will accept the hypoth- of the mentees. The problem here is that Mesis that the two groups of follow-up respond- and u are correlated.ers are still similar. Similarly, if the baselinecharacteristics of the attriters are the same If one hypothesizes that the only way theas those of the responders, then we can be participating youth differ from the averagemore confident that the attrition was simply nonparticipating youth is on measurablerandom and that the impact on the respond- characteristics (Z)—for example, they areers is indicative of the impact on all youth. more likely to be firstborn or to be Hispanic— then including these characteristics in theLesson: Comparisons of treatment (or parti- impact regression model, Equation 7, willcipant) groups and control (or compari- fully remove the correlation between M andson) groups are completely valid only if the u, because M conditional on (i.e., controllingyouth not included in the comparison are for) Z is not correlated with u. Thus, Equationsimply a random sample of those included. 7 will produce an unbiased estimate of theThis assumption is easier to believe if the impact (b):nonincluded individuals represent a smallproportion of the total sample, the baseline (7) Yfu = a + bM + cYbl + dX +fZ + ucharacteristics of nonresponders are similarto those of responders and the proportions Including such extra covariates is a commonexcluded are the same for the treatment and technique. However if, as is usually the case,control groups. one suspects (or even could plausibly argue) that the mentored group is different in other ways that are correlated with outcomes and16
  17. 17. are unmeasured, such as being more socially “working” (i.e., having longer duration) butcompetent or from better-parented families, not related theoretically to the child’s gradesthen the estimate coefficient still will be or behaviors.potentially biased. Then one estimates the following regression of M:Instrumental Variables or Two-StagedLeast Squares (10) M = k + mZ + nX + cY + wUsing instrumental variables (IV), also called bltwo-staged least squares regression (TSLS), is where w is a random error. All of the covari-a statistical way to obtain unbiased (or consis- ates that will be included in the final impacttent) impact estimates in this more compli- Equation 7, X and Ybl are included in thecated position (see Stock and Watson 2003, first-stage regression along with the instru-Chapter 10). ments Z. A predicted value of M (M’ = k + mZ + nX + cYbl) is then computed for eachConsider the factors influencing M (whether sample member. The properties of regres-the child is a mentee): sion ensure that M’ will be uncorrelated with the part of Yfu not accounted for by Ybl , or X(8) M = k + mZ + nX + v (i.e., u). M’ then is used in Equation 7 rather than M. The second stage of TSLS estimateswhere Z represents variables related to M Equation 7 and the corrected standard errorsthat are unrelated to Y, X represents variables (see Stock and Watson 2003 for details). Thisrelated to M that are related to Y and v is the technique works only if one has good pre-random error. dictive instruments. As a rule of thumb, the F-test for the Stage 1 regression should haveSubstituting Equation 8 into Equation 6 a value of at least 10 if the instrument is toresults in: be considered valid.(9) Yfu = a + b(k + mZ + nX + v) + cYbl + u Baseline PredictionsThe problem is that v (the unmeasured ele- Suspect Comparison 2 illustrates how anyments related to participating in a mentoring examination of groups defined by a programprogram, such as having motivated parents) is variable, such as having a long relationshipcorrelated with u. This correlation will cause or a cross-race match, is potentially plaguedthe regression to estimate a biased value for b. by the type of selection bias we have beenHowever, using instrumental variables, we are discussing. Schochet et al. (2001) employedable to purge out v (the elements of M that a remarkably clever nonstatistical techniqueare correlated with u) to get an unbiased esti- for estimating the unbiased impact of amate of the impact. Intuitively, this technique program in such a case. The researchersconstructs a variable that is not M but is highly knew they wanted to compare the impactscorrelated with M and is not correlated with u of participants who would choose different(an “instrument”). versions of a program. However, because one could not know who among the controlThe first and most difficult step in using this group would have chosen each program ver-approach is to identify variables that 1) are sion, it appeared that one could not makerelated to why a child is in the group being a valid comparison. To get around thisexamined, such as being a mentee or a long- problem, they asked the intake workers whomatched child, and 2) are not related to the interviewed all applicants before randomoutcome Y. These are very hard to think of, assignment (both treatments and controls)must be measured for both treatment and to predict which version of the program eachcontrol youth, and need to be considered youth would end up in if all were offered thebefore data collection starts. Examples might program. The researchers then estimatedinclude the youth’s interests, such as sports or the impact of Version A (and similarly B)outdoor activities, or how difficult it is for the by comparing the outcomes of treatmentmentor to drive to the child’s home. These and control group members deemed to bevariables would be related to the match “A-likely” by the intake workers. Note that17
  18. 18. they were not comparing the treatmentyouth who actually did Version A to theA-likely control youth, but rather compar-ing the A-likely treatments to the A-likelycontrols. Because the intake workers werequite accurate in their predictions, thistechnique is convincing. For mentoring pro-grams, staff could similarly predict whichyouth would likely end up receiving mentorsor which would probably experience long-term matches based on the information theygathered during the intake process and theirknowledge of the program. This baseline(preprogram) characteristic then could beused to identify a valid comparison.18
  19. 19. Future DirectionsSynthesis by making comparisons that undermine theGood evaluations gauge a program’s impacts balanced nature of treatment and controlon a range of more to less ambitious out- groups. Numerous statistical techniques, suchcomes that could realistically change over as the use of instrumental variables, havethe period of observation given the likely been developed to help researchers estimateprogram dosage; they assess outcomes using unbiased program impacts. However, their usemeasures that are sensitive enough to detect requires forethought at the data collectionthe expected or policy-relevant change; and stage to ensure that one has the data neededthey use multiple measures and perspectives to make the required statistical assess an impact. Recommendations for ResearchThe crux of obtaining internally valid impact Given the aforementioned issues, researchersestimates is knowing what would have happened evaluating mentoring programs should con-to the members of the treatment group had sider the following suggestions:they not received mentors. Simple pre/postdesigns assume the participant would not have 1. Design for disaster. Assume things will gochanged—that the postprogram behavior would wrong. Random assignment will be under-have been exactly what the preprogram behav- mined. There will be differential attri-ior was without the program. This is a particu- tion. The comparison group will not belarly poor assumption for youth. Experimental perfectly matched. To guard against theseand quasi-experimental evaluations are more problems, researchers should think deeplyvalid because they use the behavior of the com- about how the two groups might differ ifparison group to represent what would have any of these problems were to arise, thenhappened (the counterfactual state). collect data at baseline that could be used for matching or making statistical adjust-The internal validity of an evaluation depends ments. It is also useful to give forethoughtcritically on the comparability of the treat- to which program subgroups will be exam-ment (or participant) and control (or compar- ined and to collect variables that couldison) groups. If one can make a plausible case help predict these program statuses, suchthat the two groups differ on a factor that also as the length of a match.affects the outcomes, the estimated impactmay be biased by this factor. Because random 2. Gather implementation or process information.assignment (with sufficiently large samples) This information is necessary to understandcreates two groups that are statistically equiva- one’s impact results—why the program hadlent in all observable and unobservable char- no effect or what type of program had theacteristics, evaluations with this design are, in effects that were estimated. These data andprinciple, superior to matched comparison data on program quality also can enablegroup designs; matched comparison groups one to explore what about the program ledcan, at best, assure comparability only on the to the change.important observable characteristics. 3. Use random assignment or match on motiva-Evaluators using matched comparison groups tional factors. Random assignment shouldmust always worry about potential selection- be a researcher’s first choice, but if quasi-bias problems; in practice, researchers con- experimental methods must be used,ducting random assignment evaluations researchers should try to match participantoften run into selection-bias problems too and comparison youth on some of the less19
  20. 20. obvious factors. The more one can con- 2. Collaborate with local researchers to conduct vince readers that the groups are equiva- impact studies periodically. When program lent on all the relevant variables, including staff feel it is time to conduct a more some of the hard-to-measure factors, such rigorous impact study, they should con- as motivation or comfort with adults, the sider collaborating with local research- more credible the impact estimates will be. ers. Given the time, skills and complexity entailed in conducting impact research, trained researchers can complete the taskRecommendations for Practice much more efficiently. An outside evalu-Given the complexities of computing valid ation also may be believed more readily.impact estimates, what should a program do Researchers, furthermore, can becometo measure effectiveness? a resource for improving the program’s ongoing monitoring system.1. Monitor key process variables or benchmarks. Walker and Grossman (1999) argued that not every program should conduct a rigorous impact study: It is a poor use of resources, given the cost of research and the relative skills of staff. However, programs should use data to improve their programming (see United Way of America’s Measuring Program Outcomes 1996 or the W. K. Kellogg Foundation Evaluation Handbook 2000). Grossman and Johnson (1999) recommended that mentoring pro- grams track three key dimensions: youth and volunteer characteristics, match length, and quality benchmarks. More specifically, programs could track basic information about youth and volunteers: what types and numbers apply, and what types and numbers are matched. They could also track information about how long matches last—for example, the proportion making it to various benchmarks. Last, they could measure and track benchmarks, such as the quality of the relationship (Rhodes et al. 2005). This approach allows programs to measure factors that (a) can be tracked eas- ily and (b) can provide insight about their possible impacts without collecting data on the counterfactual state. Pre/post changes can be a benchmark (but not an impact estimate), and one must be careful that the types of youth served and the general envi- ronment are stable. If the pre/post changes for cohorts of youth improve over time, for example, but the program now is serving less needy youth, the change in this bench- mark tells little about the effectiveness of the program (the counterfactual states for the early and later cohorts differ).20
  21. 21. ReferencesBloom, H. S. Grossman, J. B. and J. E. Rhodes1984 “Accounting for No-Shows in 2002 “The Test of Time: Predictors and Experimental Evaluation Designs.” Effects of Duration in Youth Mentoring Evaluation Review, 8, 225–246. Programs.” American Journal of Community Psychology, 30, 199–206.Branch, A. Y.2002 Faith and Action: Implementation of the Grossman, J. B. and J. P. Tierney National Faith-Based Initiative for High-Risk 1998 “Does Mentoring Work? An Impact Study Youth. Philadelphia: Branch Associates and of the Big Brothers Big Sisters Program.” Public/Private Ventures. Evaluation Review, 22, 403–426.Dennis, M. L. Orr, L. L.1994 “Ethical and Practical Randomized Field 1999 Social Experiments: Evaluating Public Experiments.” In J. S. Wholey, H. P. Hatry Programs with Experimental Methods. and K. E. Newcomer, eds., Handbook of Thousand Oaks, CA: Sage. Practical Program Evaluation. San Francisco: Jossey-Bass, 155–197. Rhodes, J., R. Reddy, J. Roffman and J. Grossman 2005 “Promoting Successful Youth MentoringDuBois, D. L., B. E. Holloway, J. C. Valentine and Relationships: A Preliminary ScreeningH. Cooper Questionnaire.” Journal of Primary2002 “Effectiveness of Mentoring Programs for Prevention, 147-167. Youth: A Meta-Analytic Review.” American Journal of Community Psychology, 30, 157– Rosenbaum, P. R. and D. B. Rubin 197. 1983 “The Central Role of the Propensity Score in Observational Studies for CausalDuBois, D. L., H. A. Neville, G. R. Parra and Effects.” Biometrika, 70, 41–55.A. O. Pugh-Lilly2002 “Testing a New Model of Mentoring.” Rosenberg, M. In G. G. Noam, ed. in chief, and J. E. 1979 “Rosenberg Self-Esteem Scale.” In K. Rhodes, ed., A Critical View of Youth Corcoran and J. Fischer (2000). Measures Mentoring (New Directions for Youth for Clinical Practice: A Sourcebook (3rd ed.). Development: Theory, Research, and Practice, New York: Free Press, 610–611. No. 93, 21–57). San Francisco: Jossey-Bass. Rossi, P. H., H. E. Freeman and M. W. LipseyDuBois, D. L. and M. J. Karcher, eds. 1999 Evaluation: A Systematic Approach (6th2005 Handbook of Youth Mentoring. Thousand edition). Thousand Oaks, CA: Sage. Oaks, CA: Sage Publications, Inc. Rubin, D. B.Dynarski, M., C. Pistorino, M. Moore, 1997 “Estimating Causal Effects from LargeT. Silva, J. Mullens, J. Deke et al. Data Sets Using Propensity Scores.” Annals2003 When Schools Stay Open Late: The National of Internal Medicine, 127, 757–763. Evaluation of the 21st Century Community Learning Centers Program. Washington, DC: Schochet, P., J. Burghardt and S. Glazerman US Department of Education. 2001 National Job Corps Study: The Impacts of Job Corps on Participants’ EmploymentEccles, J. S., C. Midgley and T. F. Adler and Related Outcomes. Princeton, NJ:1984 “Grade-Related Changes in School Mathematica Policy Research. Environment: Effects on Achievement Motivation.” In J. G. Nicholls, ed., The Development of Achievement Motivation. Greenwich, CT: JAI Press, 285–331.Grossman, J. B. and A. Johnson1999 “Judging the Effectiveness of Mentoring Programs.” In J. B. Grossman, ed., Contemporary Issues in Mentoring. Philadelphia: Public/Private Ventures, 24–47.21
  22. 22. Shadish, W. R., T. D. Cook and D. T. Campbell2002 Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin.Stock, J. H. and M. W. Watson2003 Introduction to Econometrics. Boston: Addison-Wesley.Tierney, J. P., J. B. Grossman and N. L. Resch1995 Making a Difference: An Impact Study of Big Brothers/Big Sisters. Philadelphia: Public/ Private Ventures.United Way of America1996 Measuring Program Outcomes. Arlington, VA: United Way of America.Walker, G. and J. B. Grossman1999 “Philanthropy and Outcomes: Dilemmas in the Quest for Accountability.” In C. T. Clotfelter and T. Ehrlich, eds., Philanthropy and the Nonprofit Sector in a Changing America. Bloomington: Indiana University Press, 449–460.Weiss, C. H.1998 Evaluation. Upper Saddle River, NJ: Prentice Hall.W. K. Kellogg Foundation2000 W.K. Kellogg Foundation Evaluation Handbook. Battle Creek, MI: W. K. Kellogg Foundation.22
  23. 23. Public/Private Ventures2000 Market Street, Suite 600Philadelphia, PA 19103Tel: (215) 557-4400Fax: (215) 557-4469New York OfficeThe Chanin Building122 East 42nd Street, 42nd FloorNew York, NY 10168Tel: (212) 822-2400Fax: (212) 949-0439California OfficeLake Merritt Plaza, Suite 15501999 Harrison StreetOakland, CA 94612Tel: (510) 273-4600Fax: (510)