SlideShare a Scribd company logo
Gryffindor or Slytherin? The effect of an Oxford College
David Lawrence∗
Supervisor: Dr Johannes Abeler
Submitted in partial fulfilment of the requirements for the degree of
Master of Philosophy in Economics
Department of Economics
University of Oxford
Trinity Term 2016
∗I would like to thank my supervisor, Johannes Abeler, for the patient guidance, encouragement and advice he
has provided throughout my time as his student. I have been extremely lucky to have a supervisor who cared so
much about my work, and who responded to my questions and queries so enthusiastically and promptly. I am very
grateful to Dr Gosia Turner in Student Data Management and Analysis at Oxford University for providing the data and
answering my many questions about it. Valuable comments were received from Theres Lessing, Jonas Mueller-Gastell,
Leon Musolff and Matthew Ridley. This work was supported by the Economic and Social Research Council. Word
count: 29,904 (356 words on page 2, including footnotes, multiplied by 84 pages, including the title page)
Abstract
Students at Oxford University attend different colleges. Does the college a student
attends matter for their examination results? To answer this question, I use data on all
Oxford applicants and entrants between 2009 and 2013, focusing primarily on Preliminary
Examination (Prelims) results for 3 courses: Philosophy, Politics and Economics (PPE),
Economics and Management (E&M) and Law. I use two methods to account for the
possibility student ability differs systematically between colleges. First, I control for
“selection on observables” by running an OLS regression on college dummy variables
and variables capturing almost all information available to admissions tutors. Results
show that colleges matter statistically and practically. Colleges have a modest impact on
average Prelims scores, similar to the impact secondary schools have on GCSE results. A
one standard deviation increase in college effectiveness leads to a 0.11 standard deviation
increase in PPE average Prelims score. The equivalent figures are 0.15 for E&M, 0.14 for
Law and 0.09 for all courses combined. Second, I take advantage of a special feature of the
Oxford admissions process – that “open applicants” are randomly assigned to colleges –
to control for “selection on observables and unobservables”. Results suggest differences in
college effectiveness are large and accounting for unobservable ability can change college
effectiveness estimates considerably. However, the results are very imprecise so it is
difficult to draw strong conclusions. I also test whether my college effectiveness estimates
can be explained by college characteristics and find college endowment and peer effects,
operating through the number of student per course within a college, are related to college
effectiveness.
Keywords: Oxford, college effectiveness, selection bias, selection on observables and
unobservables, examination results
ii
Contents
1 Introduction 1
1.1 Prior Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Institutional Background 8
3 Theoretical Model 9
3.1 Defining College Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 College Admissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Applications and Applicant Ability . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Application Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Enrolment Probabilities and Expected Exam Results . . . . . . . . . . . . . . . 12
3.2.4 The College Admissions Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Econometric Models 16
4.1 Model 1 – Norrington Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Model 2 – Selection on Observables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Model 3 – Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . . 25
5 Data 29
5.1 Why use Four Datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Choice of Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Choice of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5.1 Testing Assumptions for Selection on Observables and Unobservables . . . . . 43
6 Results 45
6.1 Results for Norrington Table Plus and Selection on Observables . . . . . . . . . . . . 45
6.2 Robustness Checks for Norrington Table and Selection on Observables . . . . . . . . . 54
6.2.1 Alternative Outcome Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.2 Interval Scale Metric Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.3 Heterogeneity in College Effectiveness across Students of Different Types . . . 59
6.3 Results for Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . 60
7 Characteristics of Effective Colleges 65
8 Discussion and Limitations 70
9 Conclusion and Future Work 72
A Proof of Proposition 1 79
iii
List of Tables
1 Information Available in each Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Description of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Sample Selection: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Sample Selection: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Sample Selection: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Sample Selection: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Application, Offer and Enrolment Statistics: PPE and E&M . . . . . . . . . . . . . . . 37
8 Application, Offer and Enrolment Statistics: Law and All Subjects . . . . . . . . . . . 38
9 Mean Applicant and Exam Taker Characteristics: PPE . . . . . . . . . . . . . . . . . 39
10 Mean Applicant and Exam Taker Characteristics: E&M . . . . . . . . . . . . . . . . . 40
11 Mean Applicant and Exam Taker Characteristics: Law . . . . . . . . . . . . . . . . . . 41
12 Mean Applicant and Exam Taker Characteristics: All Subjects . . . . . . . . . . . . . 42
13 Tests for Differences in Mean and Variance of Applicant Ability across Colleges . . . . 42
14 P-values from Balance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
15 Regressions: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
16 Regressions: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
17 Regressions: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
18 Regressions: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
19 Correlation in College Effects across Courses . . . . . . . . . . . . . . . . . . . . . . . 54
20 Alternative Dependent Variable Regressions: PPE . . . . . . . . . . . . . . . . . . . . 56
21 Alternative Dependent Variable Regressions: E&M . . . . . . . . . . . . . . . . . . . . 57
22 Alternative Dependent Variable Regressions: Law . . . . . . . . . . . . . . . . . . . . . 58
23 P-values from Tests for Heterogeneity in College Effects across Students . . . . . . . . 60
24 Selection on Observables and Unobservables Results for various λ1: PPE, E&M and
Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
25 Selection on Observables and Unobservables Results: All Subjects, English, Maths
and History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
26 Second Stage Regression Results: Impact of Endowment . . . . . . . . . . . . . . . . . 67
27 Second Stage Regression Results: Evidence of Peer Effects . . . . . . . . . . . . . . . . 68
List of Figures
1 Applicant Ability and College Admissions Decisions . . . . . . . . . . . . . . . . . . . 15
2 College Ranking by Course: Norrington Table Plus vs Selection on Observables . . . . 48
3 Comparison of Selection on Observables College Ranking across Courses . . . . . . . . 54
4 Comparison of College Rankings across Models: All Subjects . . . . . . . . . . . . . . 64
iv
1 Introduction
The popular Harry Potter novels of J.K. Rowling are set in the fictional Hogwarts School of Witchcraft
and Wizardry where all the students are magically assigned by a “sorting hat” to one of four houses:
Gryffindor, Slytherin, Hufflepuff, and Ravenclaw. Oxford University is organised in a similar way to
Hogwarts. Oxford divides students into colleges, just as Hogwarts divides students into houses. The
college a student attends can influence not only the facilities available to them (like catering services
and libraries), their accommodation and their peers but also the teaching they receive.
In this paper I address two basic questions that arise in the context of Oxford colleges. First,
to what extent do colleges “make a difference” to student outcomes? Second are any differences in
college effectiveness1
captured by college characteristics such as endowment, age and size? To answer
these questions I use admissions and examination (exam) data on all Oxford applicants and entrants
between 2009 and 2013, focusing on how exam results (specifically first year “Prelims” results) vary
across colleges in three particular courses: Philosophy, Politics and Economics (PPE), Economics
and Management (E&M) and Law as well as across all courses (“All Subjects”).
The key complication in answering these questions is selection bias. Selection into colleges is
non-random and thus student ability may differ systematically between colleges. Selection occurs:
(i) at the application stage (students choose to apply to one college and not to others); (ii) at the
admissions stage (admission tutors take decisions to make offers to some students and not others);
and (iii) at the enrolment stage (students with offers decide whether they want to accept the offer).
Non-random selection into colleges can be based on observables characteristics (e.g. prior attainment)
and unobservable characteristics (e.g. motivation) which may themselves be correlated with exam
results. Failure to adequately control for such selection would lead to biased estimates of college
effectiveness, favouring colleges with higher ability students.
To overcome the problem of selection bias I employ two empirical methods. First, I estimate
an OLS regression which identifies college effects only under a “selection on observables” assump-
tion. Detailed data on almost all variables used by admissions tutors provides some support for
1I use the term “college effectiveness” to mean the contribution of colleges to student examination results. I use
“college effectiveness”, “college effect” and “college quality” interchangeably.
1
this assumption. Nevertheless, concern remains that “selection on unobservables” may bias college
effectiveness estimates.
Second, I take advantage of a special feature of the Oxford admissions process: some applicants
choose to make an “open application”. These applicants do not apply directly to a college, instead
their application profiles are randomly allocated between those colleges that receive relatively few
direct applicants. Intuitively, random assignment implies all colleges receive open applicants with
equal ability on average. Hence, in relative terms, colleges accepting a large proportion of open
applicants allocated to them must have received weak direct applications and have low admissions
standards, while colleges that accept a low proportion of open applicants must have received strong
direct applications and have high admissions standards. I formalise this intuition in a theoretical
model. Given additional assumptions concerning the distribution of applicant ability, this method
can account for “selection on observables and unobservables”. Exam results differences across colleges
remaining after controlling for both observables and unobservables can be considered a measure of
college effectiveness or alternatively college “value-added”.2
My results reveal colleges matter. A simple comparison of average exam results suggests large
differences between colleges. When I account for observable student characteristics, exam result
differences shrink because high ability students tend to attend more effective colleges. The vast
majority of variation in exam results is due to between-student differences. However, even after
controlling for observables there remains strong evidence that colleges differ in their effectiveness
in boosting student exam results – college effectiveness differences are statistically and practically
significant in all courses I consider. A one standard deviation increase in college effectiveness leads
to a 0.11 standard deviation increase in Prelims average score in PPE (a 0.65 mark increase). This
would be enough to move a 50th percentile student up to the 55th percentile. The estimated standard
deviation of college effectiveness is 0.15 for E&M, 0.14 for Law and 0.09 across All Subjects. College
effectiveness differences are comparable to school effectiveness differences and slightly lower than
teacher effectiveness differences.
2Although widely used, the “value-added” term is questionable because inputs and outputs are measured in different
units (Goldstein and Spiegelhalter, 1996).
2
I also produce course-specific college rankings that improve on the Norrington table3
as they
account for observable student characteristics. College rankings at an aggregate level are of limited
use because college effectiveness differs across courses – hence my focus attention on courses within
colleges. Course-specific college rankings are subject to large confidence intervals because of the low
number of students per course at each colleges.
Accounting for selection on unobservable student characteristics would likely further change the
results. Unfortunately for PPE, E&M and Law, estimation error prevents me from obtaining point
estimates for the effectiveness of each college (as only a small number of open applicants enrol
at Oxford). Instead I present college effectiveness estimates for different parameterisations of the
relationship between prior ability and exam results. I do obtain college effectiveness estimates for
some other courses (English, Maths and History) and for All Subjects combined.4
The results suggest
variation in college effectiveness remains large and that unobservable ability can dramatically change
college effectiveness estimates. However, the estimates are imprecise so it is difficult to reach strong
conclusions.
Having established that college effects exist, I use a second stage regression to examine whether
they can be explained by college characteristics. The most interesting finding is evidence that peer
effects, operating through the number of students per college studying the same course, contribute
to college effectiveness. Reversal causality is also possible – if a college happens to be strong in one
subject for whatever reason, they will be likely to hire more fellows and thus increase the size of
the cohort at that college. If there are benefits to clustering together students studying the same
subject then a potential policy implication would be to close small, under-performing courses within
a college. There is also evidence that richer colleges are more effective than poorer colleges. However,
given that college effectiveness is imperfectly correlated across courses, it seems likely that college
effectiveness is primarily determined by course-specific variables related to teaching and peer effects.
Overall, much of the variation in college effectiveness remains unexplained.
The results of this study may be of interest to a number of different audiences. First, it may
3The Norrington table, published each year, documents the degree outcomes of students at each Oxford college.
It ranks colleges using the Norrington score, devised in the 1960s by Sir Arthur Norrington, which attaches a score to
degree classifications and expresses the overall calculation for each college as a percentage.
4Though aggregating across courses makes the random assignment of open applicants far less credible.
3
interest economists studying the educational production function. At a school level, economists have
struggled to identify a systematic relationship between school resources and academic performance.
This study informs us about the relationship between college resources and academic performance.
Second, this study can help prospective students deciding which college to apply to. An Oxford
college education is an experience good, with quality difficult to observe in advance and only really
ascertained upon consumption. Thus the application decisions of prospective students are likely to
be based on imperfect information. This paper shows attending a high quality college can boost
students’ exam results which is important given the substantial economic return to better university
exam performance. Better exam performance at UK universities is closely related to entering further
study (Smith et al., 2000), employment (Smith et al., 2000), industry choice (Feng and Graetz,
2015), short-run earnings (Feng and Graetz, 2015; Naylor et al., 2015) and lifecycle earnings (Walker
and Zhu, 2013). For example, Feng and Graetz (2015) study students from the London School of
Economics and find the causal wage payoff 12 months after graduating with a First compared with an
Upper Second is a 3% higher expected wage. The difference between an Upper Second and a Lower
Second is 7% higher wages. Thus there should be demand by applicants for third parties evaluations
of college quality just as there is demand for league tables of university quality (Chevalier and Jia,
2015). My college effectiveness estimates help to fill this gap in the market – they improve on the
unadjusted college rankings currently available to prospective students in the Norrington table.56
Third, my analysis may be of interest to Oxford colleges themselves. Colleges need to measure past
effectiveness relative to other colleges for a number of reasons. It allows them to learn best practices
from, and share problems with, other colleges, evaluate their own practices, allocate resources more
efficiently and plan and set targets for the future. Yet currently colleges receive scant feedback on their
past performance in raising exam results and the information they do receive from the Norrington
table can be misleading or demoralising due to selection bias – Norrington table rank may be more
5Of course, exam based rankings are only a starting point for application decisions and should complement other
information about colleges’ quality (such as cost, location, accommodation and facilities) from publications, older
siblings, friends at Oxford and personal visits to colleges.
6More informed students may create dynamic effects as they would then be able to “vote with their feet” like
consumers in a Tiebout model. On the one hand, this may drive up college quality by increasing competition between
colleges. On the other hand, as pointed out by Lucas (1980), when criticising the Norrington table, it may increase
inequality in raw exam results between colleges because lower ranked colleges would find it difficult to recruit high
ability students. Increased competition may also discourage colleges from cooperating with each other.
4
informative about who their students are than how they were taught. My estimates provide a better
picture of a college’s performance. Furthermore, my analysis suggests colleges effectiveness may be
increased by admitting larger number of students per course, perhaps colleges should concentrate on
a narrower range of courses. Even small improvements in college effectiveness are important, because
they might be cumulative and because they refer to a large number of students.7
1.1 Prior Literature
This is the first study of differences between Oxford colleges. However my paper is related to various
literatures interested in measuring differences in effectiveness across teachers, schools and universities.
First, there is a large and active literature (much done by economists) on the value-added of
teachers in schools (Hanushek, 1971; Chetty et al., 2013a,b; Koedel et al., 2015) and universities
(Carrell and West, 2008; Waldinger, 2010; Illanes et al., 2012; Braga et al., 2014). Empirical evid-
ence shows students are not randomly assigned to teachers, even within schools or universities (e.g.
Rothstein (2009)). To account for non-random assignment, teacher value-added models use similar
methods to those in this paper – either “selection on observables” where observables include student
and family input measures and a lagged standardised test score or random assignment of students
to teachers (Nye et al., 2004; Carrell and West, 2008). The main conclusions of teacher value-added
studies also mirror my findings. Teachers like colleges vary in their effectiveness (Nye et al., 2004;
Ladd, 2008; Hanushek and Rivkin, 2010; Braga et al., 2014). Within schools, Nye et al. (2004),
reviews 18 early studies of teacher value-added. Using the same method I use (though I correct for
measurement error), they find a median standard deviation of teacher effectiveness of 0.34. Hanushek
and Rivkin (2010) review more recent studies and report estimates, adjusted for measurement er-
ror, that range from 0.08 to 0.26 (average 0.11) using reading tests and 0.11 to 0.36 (average 0.15)
in maths. They conclude the literature leaves “little doubt that there are significant differences in
teacher effectiveness” (p. 269). Within universities, Braga et al. (2014) find a one standard deviation
increase in teacher quality leads to a 0.14 standard deviation increase in Economics test scores and a
7Estimates of effectiveness similar to mine are often used for teacher and school accountability purposes. However,
for reasons detailed in section 8, I do not believe my college effect estimates should be used to hold colleges to account.
5
0.22 standard deviation increase in Law and Management test scores. Overall, teacher effects appear
slightly larger than the college effects I find (0.09 - 0.15). However, there is no consistent relationship
between teacher effectiveness and observable teacher characteristics such as education, experience or
salary (Burgess, 2015).
Second, there is a literature on the value-added of schools (though only some by economists)
(Aitkin and Longford, 1986; Goldhaber and Brewer, 1997; Ladd and Walsh, 2002; Rubin et al., 2004;
Reardon and Raudenbush, 2009). Again similar empirical strategies are used, though non-economists
tend to use random effect models whereas economists favour fixed effect models. Although school
effectiveness is found to impact test scores, there is a consistent finding that schools, like colleges,
have less impact on test scores than teachers with most estimates in the range 0.05-0.20 (Nye et al.,
2004; Konstantopoulos, 2005; Deutsch, 2012; Deming, 2014).8
In one of the most credible studies,
Deutsch (2012) takes advantage of a school choice lottery to estimate a school effect size, adjusted
for measurement error, of 0.12. School effect sizes seem similar to college effect sizes. Thomas et al.
(1997), for example, find the standard deviation in total GCSE performance between schools is 0.10
when pooled across all subjects and is higher in individual subjects ranging from 0.13 in English to
0.28 in History. This closely mirrors my results in terms terms of the size of school (college) effects,
the variation across subjects (courses) and the fact there is less variation in effectiveness once subjects
(courses) are pooled together. Therefore the impact of colleges on exam results appears similar to
the impact of schools on GCSE results. This literature also finds school resources have only a weak
relationship with test scores, leaving much variation in school effectiveness unexplained (Hanushek,
2006; Burgess, 2015).
Third, a small number of studies have attempted to measure university effects on degree out-
comes (Bratti, 2002), student satisfaction Cheng and Marsh (2010), standardised test scores (Klein
et al., 2005) and earnings (Miller III, 2009; Cunha and Miller, 2014). In the attempt to account
for selection bias, “selection on observables” methods have been used exclusively. Results suggest
large unconditional differences in outcomes across universities with observable student covariates
8School effect sizes differ depending on the age of the students – they are highest in Kindergarden, fall as students
become older until bottoming out around GCSE age and rising again in the 6th form (e.g. Goldstein and Sammons
(1997) and Fitz-Gibbon (1991)).
6
accounting for a substantial portion, but not all of these differences (Miller III, 2009; Cunha and
Miller, 2014). Observable university characteristics explained only a small proportion of variation in
university value-added (Bratti, 2002).
Beyond “value-added”, this paper is related to the research done by economists on the effect on
earnings from attending a higher “quality” university, where “quality” is usually defined in terms of
mean entry grade, expenditure per student, student/staff ratio and/or ranking in popular league
tables (Dale and Krueger, 1999; Black and Smith, 2004, 2006). Conceptually measuring the return
to institution quality is quite different to my analysis focusing on institution effectiveness. Whereas I
attempt to estimate quality directly, this literature takes quality as given and attempts to estimate the
labour market return to a higher quality. Nevertheless, the university quality literature is interesting
to consider because it has found interesting ways to tackle the non-random selection of students into
universities (better students sort into higher quality colleges). Studies tend to aggregate universities
into a small number of quality groups, thereby reducing the dimensionality of the selection problem.
This facilitates the use of selection on observables based on OLS (James et al., 1989; Black et al.,
2005), selection on observables based on matching (Black and Smith, 2004; Chevalier, 2014) and
methods to account for selection on unobservables including regression discontinuity (Saavedra, 2009;
Hoekstra, 2009), instrumental variables (Long, 2008) and applicant group fixed effects (Dale and
Krueger, 1999, 2014; Broecke, 2012).9
However, no study in this literature has had the opportunity
to exploit random assignment, as I am able to do.
The rest of the paper is organised as follows: Section 2 briefly explains the institutional back-
ground. Section 3 lays out a theoretical model of Oxford admissions that defines college effects.
Section 4 explains the problem of selection bias and outlines econometric models that account for
“selection on observables” and “selection on observables and unobservables” respectively. Section 5
describes the data. Section 6 presents the results. Section 7 considers whether college characteristics
9I considered, but ultimately rejected, using these methods to account for selection on unobservables. For instance,
matching could be applied to Oxford colleges with only minimal complications, such as in Davison (2012), but would
do nothing to help account for unobservables. Instrumental variables requires finding over 30 valid instruments, one
for each college, which is a formidable challenge. Applicant group fixed effects, work better in a university context
than a college context because they face a multicollinearity problem when students apply to only one college (see
discussion in Miller III (2009)). In addition, applicant group fixed effects make the strong assumption that students
apply to colleges in a rational way. I did estimate regressions with applicant group fixed effects but the results were
unconvincing and are not reported.
7
can explain differences in college effectiveness. Section 8 discusses limitations and section 9 concludes.
Proofs are collected in the appendix.
2 Institutional Background
The college model is one of the oldest forms of academic organisation in existence. It originated 700
years ago in the UK and was long confined to the universities of Oxford, Cambridge, and Durham.
Today however, college systems have spread worldwide. College systems now operate at several other
British universities including Bristol, Kent and Lancaster. In the US, Harvard, Yale and others have
established similar college systems. College systems are also common in Canada, Australia, and New
Zealand and are present in a numerous other countries from Mexico to China (O’Hara, 2016).
Oxford University can be thought of as consisting of two parts – (1) a Central Administration
and (2) the 32 colleges.10
The Central Administration is composed of academic departments, re-
search centres, administrative departments, libraries and museums. The Central Administration (i)
determines the content of the courses within which college teaching takes place, (ii) organises lectures,
seminars and lab work, (iii) provides resources for teaching and learning such as libraries, laborator-
ies, museums and computing facilities, (iv) provides administrative services and centrally managed
student services such as counselling and careers and (v) sets and marks exams, and awards degrees.
The colleges are self-governing, financially independent and are related to the Central Administra-
tion in a federal system not unlike the federal relationship between of the 51 states of America and
the US Federal Government. The colleges (i) select and admit undergraduate students, (ii) provide
accommodation, meals, common rooms, libraries, sports and social facilities, and pastoral care for
their students and (iii) are responsible for tutorial teaching for undergraduates. Thus Oxford colleges
play a significant role in university life, making Oxford an ideal place to study college effects.
10There are also five Permanent Private Halls at Oxford admitting undergraduates. They tend to be smaller than
colleges, and offer fewer subjects but are otherwise similar. From now on I include them when I refer to “colleges”.
8
3 Theoretical Model
In this section I develop a theoretical model of college admissions. The model serves two main
purposes. First, it allows me to formally define the “effect” of attending an Oxford college. A failure
to clearly define the causal effect of interest has been a criticism of much of the school effect literature
(Rubin et al., 2004; Reardon and Raudenbush, 2009). Second, the model motivates the empirical
strategies I employ to identify college effects in section 4.
3.1 Defining College Effects
There are a total of N applicants to Oxford indexed i = 1, 2, ..., N and J colleges indexed j =
1, 2, . . . , J. For each student i there exist J potential exam results Y 1
i , Y 2
i , . . . .Y J
i , where Y j
i denotes
the exam result at some specified time (such as end of year 1) that would be realised by individual i
if he or she attended college j. Let each potential exam result depend on pre-admission ability Ai, a
1 x K row vector. Ai permits multiple sources of ability which may be observable or unobservable.
It should be interpreted broadly to include not only cognitive ability but also motivation. Potential
exam results also depend on college effects cij, which are allowed to vary across students, and a
possibly heteroskedastic random shock eij, uncorrelated with ability and representing measurement
error in exam results such as illness on the day of the exam and subjective marking of exams. The
potential exam result obtained by an individual i who attends college j is:
Y j
i = Y j
i (Ai, cij, eij). (1)
For student i the causal effect of attending college j as opposed to college k is the difference in
potential outcomes Y j
i − Y k
i . The main focus of this paper is on estimating the average causal effect
of college j relative to a reference college k for the subpopulation of n ≤ N students who actual enrol
at Oxford (denoted by the set E). This average causal effect of college j relative to college k is:
¯βj = cj − ck =
1
n
i∈E
cij −
1
n
i∈E
cik. (2)
Focusing on the subpopulation of students who attend Oxford, rather than the full population
of applicants, makes sense because many applicants (perhaps due to weak prior achievement at
9
school) may have only a low chance of attending Oxford. The definition college effects relies on two
assumptions.
Assumption 1. “Manipulability”: Y j
i exists for all i and j
Assumption 1 is the assumption of manipulable college assignment (Rosenbaum and Rubin, 1983;
Reardon and Raudenbush, 2009). It says each student has at least one potential outcome per college.
Intuitively to talk about the effect of college j one needs to be able to imagine student i attending
college j, without changing the student’s prior characteristics Ai. “Manipulability” would be violated,
for instance, if a college only accepted women implying the potential outcome of a male student at that
college may not exist. This assumption is relatively unproblematic at Oxford (certainly compared
to schools or universities). Oxford colleges are not generally segregated by student characteristics11
so it is not difficult to imagine Oxford applicants attending different colleges. Randomness in the
admissions process also makes it possible that all applicants have at least some chance, however
small, of being offered a place at an Oxford college.
Assumption 2. “No interference between units” : Y j
i is unique for all i and j
Assumption 2 says each student possesses a maximum of one potential exam result in each college,
regardless of the colleges attended by other students (Reardon and Raudenbush, 2009). The “no
interference between units” assumption of Cox (1958) is one part of the “Stable Unit Treatment Value
Assumption” (or SUTVA; Rubin, 1978). Strictly speaking, this means that a given student’s exam
result in a particular college does not depend on who his college peers are (or even how many of them
there are). Evidence of peer effects in education make this assumption questionable (e.g. Feld and
Zölitz, 2015). Without it, however, we must treat each student as having as JN
potential outcomes,
one for each possible permutation of students across colleges. Thus adopting the no interference
assumption makes the problem of causal inference tractable (at the cost of some plausibility). The
consequences of violations of this assumption on the estimates of college effects are unclear, since
without it the causal effects of interest are not well-defined.
11St Hilda’s, the last all women’s college started accepted men in 2008. An exception is colleges that accept only
mature students such as Harris Manchester.
10
3.2 College Admissions
3.2.1 Applications and Applicant Ability
Responsibility for admissions is devolved at the college level, then again at the course level. To save
notation, let all applicants apply for the same course. College j is allocated (receives the application
profiles of) Dj direct applicants and Oj open applicants to consider for admission.
The direct applicants received by college j are the students who expressed a preference for college j
on their application forms - they applied directly to college j. In total there are D1+D2+. . .+DJ = D
direct applicants to Oxford. Let the ability of direct applicants to each college be normally distributed
with the mean ability of direct applicants allowed to differ between colleges but with the variance
constrained to be the same for all colleges. In particular, let the ability of direct applicants to college
j be distributed AD
j ∼ N(µD
j , 1) where AD
j is the ability of a direct applicant to college j and µD
j is
the mean ability of direct applicants to college j.
Colleges also receive open applicants. In total there are O1 + O2 + . . . OJ = O open applicants
to Oxford and their ability follows standard normal distribution: AO
∼ N(0, 1). Oxford admissions
procedures require that all open applicants are pooled together by the Undergraduate Admissions
Office. Open applicants are then randomly drawn out, one at a time and are allocated to the college
with the lowest direct applicant to place ratio. This random assignment to colleges, it the key to my
selection on unobservables identification procedure. I present evidence in section 5.5.1 that supports
random assignment. Since each college receives a random sample (of size Oj) of open applicants, the
ability of open applicants sent to college j, denoted AO
j , is also distributed N(0, 1).
3.2.2 Application Profiles
Admissions at Oxford colleges are conducted by faculty, who are also researchers and teachers, in the
subject a student applies for (referred to as “admissions tutors”). Applicant ability Ai and college
effects cij are not perfectly observable to admissions tutors. Instead colleges observe an applicant’s
application profile (“UCAS form”) which includes both “hard characteristics” such as GCSE results, A-
level results and the results of Oxford-specific admission tests and “soft” characteristics such as school
11
reference letters and evidence of enthusiasm in the personal statement.12
The application profile does
not include whether an applicant was a direct applicant or an open applicant. Application profiles
can be thought of as a noisy signal of the ability of each applicant. Denote the characteristics of
applicant i seen by admission tutors as a 1 x K row vector xi = Ai −ri where ri is a 1 x K row vector.
Each of the K elements in xi provides a signal about a component of ability Ai. For example, maths
GCSE result provides a signal of maths ability. Assume that each element of xi is an unbiased signal
for its equivalent element in Ai such that E(Ai|xi) = Ai. Also assume xi and cij are independent,
that is, application profile xi provides admissions tutors with no information about college effects cij
(This assumption is relaxed in some of the empirical work). Let X denote the support of x and let
Xj
denote the support of the application profiles for students allocated to college j. Let ηj(x) be the
number of students allocated to college j with application profile x.
3.2.3 Enrolment Probabilities and Expected Exam Results
Let αj(x) denote the probability that student with application profile x, upon being offered admission
at college j, eventually enrols. Let Yj(x) denote the expected exam result of an applicant with
application profile x who enrols at college j. This allows acceptance or rejection of an offer from
college j to provide extra information about the ability (and expected exam result) of an applicant.
Colleges need to condition on acceptance when making admissions decisions in order to make a
correct inference about the student’s ability because of an “acceptance curse”: the student might
accept college j’s admission because she is of low ability and is rejected by other universities (either
UK or foreign).
3.2.4 The College Admissions Problem
Define an admission protocol for college j as a probability pj : Xj
→ [0, 1] such that an applicant
allocated to college j with application profile x is offered admission at college j with probability
pj(x). Each college has a capacity constraint, Kj (the maximum number of students college j can
12Information on ethnicity and parental social class is also collected on the UCAS form but this information is not
available to admissions tutors when they decide on admissions
12
admit). College j thus chooses the set of pj(x) ∈ [0, 1] to maximise their objective function:
maxpj (x)
x∈Xj
pj(x) αj(x) ηj(x) Yj(x) (3)
subject to their capacity constraint:
x∈Xj
pj(x) αj(x) ηj(x) ≤ Kj. (4)
This is almost identical to the university admissions decision problem studied by Bhattacharya
et al. (2014) (see also Fu (2014)). The college objective is to maximise total expected exam results
among the admitted applicants. It implicitly assumes “Fair Admissions” (Bhattacharya et al., 2014),
in the sense that it gives equal weight to the exam results of all applicants, regardless of pre-admission
characteristics. This assumption is plausible at Oxford because Oxford emphasises that applicants
are admitted strictly based on academic potential. Extra-curricular activities, such as sport and
charity work are given no weight unless they are related to academic potential. “Fair Admissions”
is consistent with the “Common Framework” which guides undergraduate admissions at Oxford:
“Admissions procedures in all subjects and in all colleges should [. . . ] ensure applicants are selected
for admission on the basis that they are well qualified and have the most potential to excel in their
chosen course of study” (Lankester et al., 2005).
The solution to college j’s admissions problem takes the form described below in Proposition 1,
which holds under Condition 1: admitting everyone with an expected exam result Yj(x) ≥ 0 will
exceed capacity in expectation (Bhattacharya et al., 2014).
Condition 1. αj(x) > 0 for any x ∈ Xj
and for some δ > 0 we have
x∈Xj
αj(x) ηj(x) 1{Yj(x) ≥ 0} ≥ Kj + δ.
Proposition 1. Under Condition 1 the solution the college j’s admissions problem is:
pOP T
j =
1 if Yj(x) ≥ zj
0 if Yj(x) < zj
where
zj = min r : x∈Xj αj(x) ηj(x) 1 {Yj(x) ≥ r} ≤ Kj
13
Proof in Appendix.
The model shows that college j uses a cut-off rule (admission threshold). The result is intuitive.
Colleges first rank applicants by their expected exam results (conditional on acceptance). Colleges
then admit applicants whose expected exam results are the largest, followed by those for whom it is
the next largest and so on till all places are filled. An admissions policy for the ranked groups {pj(x)}
takes the form {1, . . . , 1, 0, . . . , 0}. Since ability is continuously distributed and x is an unbiased signal,
x is also continuously distributed. Hence there are no point masses in the distribution of Yj(x) and
there is no need for account for ties.
As noted by Bhattacharya et al. (2014), the probability of a student enrolling having received an
offer from college j affects the admission rule only through its impact on the cut-off; the intuition is
that individuals who do not accept an offer of admission do not take up any capacity and this is taken
into account in the admission process. Also note that the assumptions imply, perhaps unrealistically,
no role for risk in admissions decisions.
The Fair Admissions assumption implies student characteristics influence the admission process
is through their effect on expected exam results. The same cut-off zj is used for open and direct
applicants - there is no discrimination against open/direct applicants (or any demographic group).
Discrimination would occur if colleges had a higher cut-off for open applicants than direct applicants
as this would imply that a direct applicant with the same expected exam result as a open applicant
is more likely to be admitted. Equal cut-offs for open and direct applicants are plausible because,
as noted above, colleges are not provided with any information about whether an applicant applied
directly or was an open applicant.
The solution is illustrated in Figure 1 for the case where applicant ability is fully observed by
admissions tutors: xi = Ai (ri = 0 for all i).13
13This model is a highly stylised model of admissions. For simplicity, it ignores a number of features of the admissions
process. Oxford admissions actually involve multiple stages. In the first stage colleges choose which applicants to
“short-list” and “deselect” and which applicants to “reserve”. Deselected applicants are rejected. Short-listed and
reserved applicants are given interviews at the college they were allocated. Shortlisted but unreserved applicants may
be reallocated to another college for interview. After first interviews colleges make some admissions decisions about
which applicants to accept. However, a small number of applicants are given second interviews. Second interviews
provide applicants not selected by their first college the chance to be accepted by another college (known as “pooling”).
It should also be noted that application procedures vary slightly between courses. Capturing all these points would
involve a more complex dynamic game played between colleges. Nevertheless, my empirical work relies only on the
14
Figure 1: Applicant Ability and College Admissions Decisions
Aj
D
∼ N(µj
D
,1)
Aj
O
∼ N(0,1)
pj
D
pj
O
−4 −2 0 2 4zjµj
D
Ability
Figure 1 shows how colleges would make admissions decisions if ability was fully observable (i.e. Ai = si).
Direct applicant ability to college j is distributed Aj
D
∼ N(µj
D
,1). The graph is drawn such that µj
D
= 0.5.
Open applicant ability to college j is distributed Aj
O
∼ N(0,1). zj is the cut−off (admissions threshold). All
students with ability above the cut−off (the shaded area) are admitted. The distribution of ability for
successful open applicants to college j follows a truncated normal distribution and similarly for successful
direct applicants. A proportion pj
D
of direct applicants and a proportion pj
O
of open applicants are accepted.
With this admissions model in mind, the goal is to estimate the college effects cij. I consider three
different empirical models. First, as a simple baseline, I consider differences in mean exam results
between colleges in the spirit of the Norrington table. Second, I use a “selection on observables”
strategy that attempts to estimate college effects by conditioning on almost all the information
available to admissions tutors in the student’s application profile. Third, I take advantage of the
random assignment of open applicants and estimate the thresholds zj for each college. I then use
these threshold estimates together with the assumptions of the theoretical model to obtain estimates
of college effects. The next section explains these strategies in detail.
result that colleges use a cut-off rule and that the cut-off is equal across all applicants. This result would continue
to hold if, for example, (i) no new information about applicant ability was revealed at interview, (ii) colleges could
correctly predict the admissions decisions of other colleges and (iii) the reallocation of rejected applicants was known
in advance by the colleges.
15
4 Econometric Models
The econometric models in this section must acknowledge some objects in the theory model are un-
observable. First, exam results for applicants who do not attend Oxford are not observed. Second,
even for the applicants who enrol at Oxford, at most one potential exam result per student is ob-
servable (the potential exam result from the college they actually attend). This is the “fundamental
problem of causal inference” (Holland, 1986). With a slight abuse of notation I denote observed exam
results of student i at college j as Yij for i = 1, ..., n. Third, not all the information in an applicant’s
application profile is observable. Decompose the information in application profiles into two parts:
x = x1 +x2 where x1 and x2 are 1 x G and 1 x K - G row vectors with (with K G and remembering
x is 1 x K). “Hard” information x1 is assumed observable to admissions tutors and researchers. “Soft”
information x2 is assumed observable to admissions tutors but not researchers.
The aim is to identify college effects given the available data. All three empirical strategies take
the potential exam results function (1) specified in section 3 and assume observed exam results take
the linear form:
Yij = λ0 + λ1Ai + cij + eij (5)
where λ0 and λ1 are K x 1 column vectors that map ability onto potential exam results and all
elements of λ1 are strictly positive. I can now decompose Ai into x1i, x2i and ri and rewrite (5) as:
Yij = λ0 + λ11x1i + λ12x2i + cij + λ1ri + eij (6)
where λ11 is a G x 1 column vector of the first G elements of λ1 and λ12 is a K - G x 1 row vector
of the last K - G elements of λ1. Student ability unobserved even by admissions tutors is captured
by ri.
4.1 Model 1 – Norrington Table
The first empirical strategy is to estimate college effects using a student-level fixed effects regression
with no control variables for observable or unobservable ability. That is, Model 1 estimates for
16
enrolled students:
Yij = λ0 +
J−1
j=1
¯βjCj + vij ∀ i = 1, ..., n (7)
where vij =
J−1
j=1 (βij − ¯βj)Cj + λ1Ai + eij, Cj is a dummy variable denoting enrolment at college
j, βij is a college fixed effect coefficient which may differ across i and ¯βj = 1
n
n
i=1 βij is the average
over students of the college fixed effects. College J is the reference college. Model 1 can be estimated
by regressing exam results on a set of college dummy variables. The fixed effect coefficients ¯βj are
the objects of interest, they give mean differences in exam results relative to the baseline college.
Model 1 is thus similar in spirit to the Norrington table.14
The most important problem with Model 1 (and the Norrington table) is selection bias. Selection
bias prevents us from interpreting the fixed effect coefficient estimates as causal effects. Randomised
experiments are the gold standard for estimating causal effects and imagining a hypothetical random-
ised experiment helps to conceptualise the selection bias problem. Consider a two stage admissions
process. In stage 1 it is decided which students will attend Oxford. In stage 2 admitted students are
randomly assigned to colleges. In this ideal scenario, college assignment is independent of student
ability among the population of enrolled students, so the simple mean difference in observed exam
results gives an unbiased estimate of differences in college effects for students attending Oxford.
Unfortunately for researchers selection into colleges is non-random in ways that are correlated
with exam results. Students and admission tutors deliberately and systematically select who enrols.
At the application stage, students choose where to apply to. At the admissions stage, admission
tutors take decisions to accept some students and not others. There could also be selection at the at
the enrolment stage (in practice, very few students reject offers from Oxford colleges). The selection
bias problem makes it difficult to attribute student exam results to the effect of the college attended
separately from the effect of preexisting student ability.
Formally, since we have assumed λ11 = 0 and λ12 = 0, selection bias occurs if:
14Model 1 does differ from the Norrington table is some ways. For instance, the Norrington table does not take into
account of differences across courses (getting a First in E&M may be easier or more difficult than getting a First in
Law). As I explain in section 5 below, I standardise exam results by course and year which mitigates this problem.
17
E


J−1
j=1
(βij − ¯βj)Cj + λ1Ai + eij|cij

 = 0.
Model 1 embodies two types of non-random selection into colleges. First, selection on the het-
erogeneous college effect βij. This occurs if individuals differ in their potential exam results, holding
ability Ai constant, and if they choose a college (or colleges chooses them) in part on that basis.15
Selection on heterogeneous college effects captures the intuition that students and colleges are looking
for a good “match”. The economics of the problem suggest students will tend to apply to colleges that
are relatively good at boosting their exam results - a form of selection bias that bares similarities to
Roy’s model of occupational choice (Roy, 1951). Similarly colleges will tend to make offers to students
who tend to benefit more than average from the college’s teaching. Students enrolled at college j
may thus have higher expected exam results from attending college j than the average student. This
biases college fixed effect coefficients and it would not be appropriate to interpret such estimates of
as causal effects for the average student enrolled at Oxford (though college effect estimates biased in
this way may still be of interest).
Second, selection on ability Ai. Determinants of exam results may be correlated with college
enrolment even if college effects are constant across students (βij = βj for all i). This occurs if
individuals choose colleges or colleges choose students in ways correlated with prior ability. Rational
applicants will choose to apply to the college that maximises their expected utility. Expected utility
is likely to depend on a number of factors including the perceived probability of receiving an offer
from each college, risk aversion, the value of their outside option if they did not attend Oxford
and preferences over college characteristics (including college effectiveness and other characteristics
contributing towards consumption benefits). Observable and unobservable ability are likely to impact
the college a student applies to. Furthermore college admissions decisions are based on student ability.
Positive selection seems likely, though not inevitable, with students of higher ability tending to go
to more effective colleges. In the presence of such selection, estimates of the college fixed effect
coefficients will be biased in favour of colleges with higher ability students.
15This assumes students and tutors have an idea of their own student/college-specific coefficient.
18
Selection bias causes three problems. First, as discussed, college effectiveness estimates are biased.
Second, the importance of variation in college effectiveness in determining exam results could be
exaggerated. The total effect of colleges on student exam results could be overstated because some of
the omitted ability will be included in the portion of the variance in student exam results explained by
college effects.16
Third, bias would lead to errors in supplementary analyses that aim to identify the
characteristics effective colleges. Selection bias implies Model 1 is best used as a basis for comparison
with other models that control for observables and unobservables.
4.2 Model 2 – Selection on Observables
The second empirical strategy is to estimate college effects using a conditional OLS regression. Model
2 estimates for enrolled students:
Yij = λ0 + λ11x1i+
J−1
j=1
¯βjCj + vij ∀ i = 1, ..., n (8)
where now vij =
J−1
j=1 (βij − ¯βj)Cj +λ12x2i +λri +eij. The difference between Model 1 and Model 2
is that now observable parts of application profiles x1i are included in the regression. The objects of
interest are the college fixed effect coefficients: ¯βj. In an ideal scenario, we could interpret estimated
coefficients as estimates of the average causal effect relative to the reference college for students
attending Oxford. However, such a causal interpretation requires three further assumptions. I start
with two that are relatively unproblematic.
Assumption 3. “Interval scale metric”. The metric of Yij is interval scaled.
Assumption 3 says that the units of the exam result distribution are on an interval scale (Ballou,
2009; Reardon and Raudenbush, 2009). Interval scales are numeric scales in which we know not only
the order, but also the exact differences between the values. Here the assumption says equal sized
gains at all points on the exam result scale are valued equally. A college that produces two students
with scores of 65 is considered equally as effective as a college producing one with a 50 and another
16The effect of the bias on variation in college quality would depend on the direction of the bias. The text here
presumes the likely scenario with positive selection bias – i.e., where more effective colleges are assigned students with
higher expected exam results.
19
with 80. In comparing mean values of exam results, I implicitly treat exam results as interval-scaled
(the mean has no meaning in a non-interval-scaled metric). If exam results are not interval scaled
then the college effect results will depend on arbitrary scaling decisions.17
However, it is unclear
how to determine whether exam results are interval scaled because there is often no clear reference
metric for cognitive skill (Reardon and Raudenbush, 2009). At a practical level, the importance of
this assumption comes down to the sensitivity of college effects estimates and college rankings to
different transformation of exam results. Prior evidence on this point is reassuring, Papay (2011)
finds test scaling affects teacher rankings only minimally with correlations between teacher effects
using raw and scaled scores exceeding 0.98.18
I proceed as if exam results are interval scaled and in
section 6 test the robustness of my results to various monotonic transformations of the exam results
distribution.
Assumption 4. “Common Support or Functional form”. Either (i) there is adequate observed data in
each college to estimate the distribution of potential exam results for students of all types (“Common
Support”) or (ii) the functional form of Model 2 correctly specifies potential exam results even for
types of students who are not present in a given college (“Functional Form”).
Either “Common Support” or “Functional Form” must hold for college effects to be identified.
The common support assumption is violated if not all colleges contain students with any given set
of characteristics. For instance, if not all colleges have students at all ability levels (or not sufficient
numbers at all levels to provide precise estimates of mean exam results at each ability level), then
the common support assumption will fail. In this case we have identification via functional form -
the model extrapolates from regions with data into regions without data by relying on the estimated
parameters of the specified functional form. If the functional form is also wrong, then regression
estimators will be sensitive to differences in the ability distributions for different colleges. However,
if the distribution of ability are similar across colleges the precise functional form used will not
matter much for estimation (Imbens, 2004). The common support assumption has been questioned
17This assumption could be relaxed by adopting a non-parametric approach (and comparing, for example, quantiles
rather than means) but this would require a very large sample size for accurate estimation.
18If two colleges have similar students initially, but one produces students with better exam results, it will have a
higher measured college effect regardless of the scale chosen. Similarly, if they produce the same exam results, but one
began with weaker students, the ranking of the colleges will not depend on the scale.
20
for schools because student covariates differ significantly across schools. However, the distribution of
ability is likely to be much more similar across Oxford colleges, partly because of student reallocation
across colleges during the admission process.
We now come to the most significant problem in estimating college effects: how to deal with
selection bias. I make the following two-part “selection on observables” assumption, which allows
consistent estimation of college effects:
Assumption 5. “Selection on Observables” (i) E
J−1
j=1 (βij − ¯βj)Cj | Cj, x1i = 0 ∀ i = 1, ..., n
(ii) E [λ12x2i + λri + eij | Cj, x1i] = 0 ∀ i = 1, ..., n
The selection on observables assumption follows work by Barnow et al. (1981) in a regression
setting who observed that unbiasedness is attainable only when the variables driving selection are
known, quantified and included in x1.19
Together parts (i) and (ii) imply that potential exam results
are independent of college assignment, given x1.
Part (i) requires the heterogeneous part of college effects to be mean independent of college
enrolment conditional on x1i and Cj. This assumption is similar to, but slightly weaker than, college
effects being the same for every student. It implies there is no interaction of college effects with
student characteristics in x1i. As noted above, if individuals differ in their college effects, and they
know this, they ought to act on it, even conditional on ability. Thus this assumption relies on
students and tutors being unaware of college effects.20
In the empirical work, I test this assumption
by allowing the college effect coefficients to vary with some elements of x1i.
Part (ii) says the observable control variables x1i are sufficiently rich that the remaining variation
in college enrolment that serves to identify college effects is uncorrelated with the error term in
equation (8). This requires two things. First, the observable control variables in x1i must capture,
either directly or as proxies, all the factors that affect both the college enrolment and exam results.
Second, there must exist variables not included in the model that vary college enrolment in ways
unrelated to the unobserved component of exam results (i.e. instrumental variables must exist, even
19Non-parametric versions of this assumption are variously known as “conditional independence assumption” Lechner
(2001) and “unconfoundedness” Rosenbaum and Rubin (1983). These are also closely related to “strongly ignorable
assignment”Rosenbaum and Rubin (1983).
20If college effects were obvious to everyone then there would be no need for this thesis!
21
though we do not observe them, as they produce the conditional variation in college enrolment used
implicitly in the estimation). Intuitively, the aim is to compare two otherwise identical students but
who went to different colleges for a reason completely unrelated to their exam results. Practically, I
would like to measure and condition on any characteristic whose influence on exam results might be
confounded with that of college enrolment due to non-random sorting into different colleges.
I am aware that the selection on observables assumption is somewhat heroic. Unobservable ability
could cause it to be violated. For instance, students with very high unobservable ability x2i (including
excellent school references and personal statements) may be close to certain of receiving an offer from
whichever college they apply to and thus may tend to apply to colleges with larger college effects.
Alternatively more “academically motivated” students may be both more likely to apply to colleges
that improve exam results than college that provide large consumption benefits. If students do select
into colleges based on unobservable ability correlated with exam results conditional on observed
characteristics then selection bias results.
Nevertheless, the selection on observables assumption can be justified in a number of ways. First,
the extensive dataset allows me to condition on almost all information available to college admission
tutors when they are selecting students as well as some information not seen by admissions tutors.
Furthermore, there is evidence that the information available to admissions tutors but unavailable
to researchers, the personal statement and school reference, are relative unimportant in admission
decisions. In the personal statement, students describe the ambitions, skills and experience that
make them suitable for the course (e.g. previous work experience, books students have read and
essay competitions they have entered). However, Oxford admissions are strictly academic so this
only impacts admissions decisions if it is linked to academic potential. The absence of the school
reference is also perhaps of limited significance because, as noted by Bhattacharya et al. (2014),
school references tend to be somewhat generic and within-school ranks are typically unavailable
to admission tutors. This is supported by survey evidence. Bhattacharya et al. (2014) conduct an
anonymised online survey of PPE admissions tutors in Oxford asking much weight they attach during
admissions to covariates with "1" representing no weight and "5" denoting maximum weight. The
results, based on 52 responses, found that the personal statement and school reference were given
22
the lowest weights.21
Second, two students with the same values for observed characteristics may go to different colleges
without invalidating the selection on observables assumption if the difference in their colleges is driven
by differences in unobserved characteristics that are themselves unrelated to exam results. There are
plenty of potential sources of exogenous variation in college allocations conditional on observables.
For instance, students might care about factors other than the ability of colleges to boost exam
results. Observation indicates that many applicants explicitly choose among colleges, at least at
the margin, for reasons unlikely to be strongly related to exam results. Application decisions may
reflect preferences over college location, architecture, accommodation, facilities and size. These
preferences may not be strongly linked to ability to perform well in exams. Indeed selection based on
preferences over college characteristics is actively encouraged by the University - the Oxford website
recommends students choose colleges based on these non-academic considerations. Alternatively
applicants might be incapable of discerning the size of college effects. While this would not normally
be a comforting thought, it aids the selection on observables assumption. Evidence from university
admissions supports this point. Scott-Clayton (2012) reviews the literature on university admissions
and concludes applicants and parents often know very little about the likely costs and benefits of
university. For instance, small behavioural economics tricks such as whether or not a scholarship has
a formal name and a tiny change in the cost of sending standardised test scores to universities have
been shown to have non-trivial effects on university applications inconsistent with rational choice
(Avery and Hoxby, 2004; Pallais, 2013). The school choice literature also provides evidence that
students and parents do not select schools according to expectations about future test scores - the
typical voucher program does nothing to improve test scores (Epple et al., 2015). Such exogenous
variation is perhaps even more likely in the context of Oxford colleges because Oxford deemphasises
the importance of college choice, stressing all colleges are similar academically and that the primary
factor when choosing a college college choice should be consumption benefits not exam results.
A couple of final points about Model 2 should be noted. First, since I have multiple cohorts
of students, I pool students across cohorts for each college. Evaluating colleges over multiple years
21A-levels appeared to be the most important criterion, followed by the admissions tests and interview scores and
then GCSE performance. The choice of subjects at A-level was given a medium weights.
23
reduces the selection bias problem (Koedel and Betts, 2011), increases students per college thus
reducing average standard errors (McCaffrey et al., 2009) and increases the predictive value of past
college effects over future college effects (Goldhaber and Hansen, 2013). In pooling across cohorts,
I assume that college effects are fixed over time and thus place equal weight on exam results in all
years.22
Second, I allow for heteroskedastic measurement error in exam results by estimating heteroske-
dasticity robust standard errors. Exam results measure latent achievement with error because of (i)
the limited number of questions on exams, (ii) the imperfect information provided by each question,
(iii) maximum and minimum marks, (iv) subjective marking of exams and (v) individual issues such
as exam anxiety or on-the-day illness (Boyd et al., 2013). Numerous studies find test score meas-
urement error is larger at the extremes of the distribution (Koedel et al., 2012). The intuition is
exams are well-designed to assess student learning for “targeted” students (near the centre of the
distribution), but not for students whose level of knowledge is not well-aligned with the content
of the exam (in the tails of the distribution). Ignoring heteroskedastic measurement error in the
dependent variable would lead to biased inference. In addition, ignoring measurement error in the
control variables would bias college effect estimates. However, I control for multiple prior test scores
(A-levels, GCSEs, IB, multiple admissions tests and interview scores) which has been shown to help
mitigate the problem (Lockwood and McCaffrey, 2014).
Third, I treat college effects as fixed effects rather than random effects. Whilst random effects
models are more efficient than fixed effects models, economists have conventionally avoided random
effect approaches (Clarke et al., 2010). This is because their use comes at the cost of an important
additional assumption - that college effectiveness is uncorrelated with the student characteristics
that predict exam results. This “random effects assumption” would fail, for example, if more effective
colleges attracted high ability students measured by prior test scores. Random effect estimators
would be inconsistent for fixed college sizes as the number of colleges grows.23
By contrast, fixed
22As the number of cohorts grows, “drift” in college performance may put downward pressure on the predictive
power of older college effect estimates. Thus if predicting future college effects is the main aim (relevant for prospective
applicants to Oxford) then it may be best to down-weight older data (Chetty et al., 2013a). However, my main aim is
to gauge the importance of college effectiveness and thus do not account for drift.
23The bias (technically, the inconsistency) disappears as the number of students per college increases - because the
random effect estimates converge to fixed effect estimates. However, the bias still can be important in finite samples.
24
effect estimators will still be consistent for fixed college sizes as the number of colleges grows. Guarino
et al. (2015) find that under non-random assignment, random effect estimates can suffer from severe
bias and underestimate the magnitudes of college effects. They conclude fixed effect estimators should
be preferred in this situation and I follow their advice and specify college effects as fixed effects. In
section 6, I perform Hausman tests (robust to heteroskedasticity) and the results broadly support
this choice.
Fourth, I do not employ shrinkage to my college effect estimates. Estimates can be noisy when
there are only a small number of students per college. This means colleges with very few students
could be more likely to end up in the extremes of the distribution (Kane and Staiger, 2002). Shrinkage
is often used as a way to make imprecise estimates more reliable by shrinking them toward the
average estimated college effect in the sample (a Bayesian prior). As the degree of shrinkage depends
on the number of students per college, estimates for colleges with fewer students are more affected,
potentially helping with the misclassification of these colleges. The cost of shrinkage is that the
weight on the prior introduces a bias in estimates of college effects. Shrinkage can be applied to
both random and fixed effects models (so shrinkage is not a reason to favour random effect models
as is sometimes suggested). Despite the promise of shrinkage, two studies use simulations to show
shrinkage does not itself substantially boost performance (Guarino et al., 2015; Herrmann et al.,
2013). Fixed effect models without shrinkage tend to perform well in simulations and should be the
preferred estimator when there is a possibility of non-random assignment.
Even though I avoid having to make the random effects assumption, there is still a danger that
the selection on observables assumption is violated. As a result I now move on to Model 3 which can
more effectively deal with unobservables.
4.3 Model 3 – Selection on Observables and Unobservables
In this subsection I use a novel procedure to estimate college effects and account for both selection
on observables and unobservables. To do this, I take the theory model of section 3 as a starting
point and assume the ability Ai is a scalar (with multiple sources of ability, Ai can be interpreted
as a composite scalar index, i.e. a weighted average). When ability Ai is a scalar, I can estimate the
25
admission thresholds zj for each college. Admission thresholds can be consistently estimated because
open applicants are randomly allocated to colleges. I then use these threshold estimates and the
linear function form assumption (5) to obtain estimates for Ai and λ1. Colleges with high admissions
thresholds tend to have high ability entrants. This allows me to obtain college effect estimates. I
now explain this procedure in more detail.
First, remember in the theory model of section 3, the ability of open applicants to Oxford was
distributed N(0, 1). The key to identification is that open applicants are randomly allocated, by
the Undergraduate Admissions Office, to colleges. Intuitively, the random allocated means that all
colleges receive open applicants with equal ability on average. If a college accepts a large proportion
of open applicants, this suggests that their cut-off zj is low and their entrants have relatively low
ability. On the other hand, if a college accepts a small proportion of open applicants then we expect
their cut-off to be high and their entrants to be of relatively high ability. Formally, the ability of open
applicants allocated to college j is also distributed N(0,1). This means we can consistently estimate
the true cut-off zj at college j using the estimator:
ˆzj = Φ−1
1 − pO
j (9)
where is Φ is the standard normal cdf and pO
j is the proportion of open applicants allocated to
college j who are offered a place at college j (pO
j is the area in the upper tail of the standard normal
distribution). When pO
j is large, ˆzj is small and vice versa. In an infinite sample we could determine
the cut-off value zj exactly. However colleges are assigned a finite number of open applicants so we
estimate zj using ˆzj. As a simple example, consider the case where a college accepted 5% of the
open applicants they were allocated by the Undergraduate Admissions Office. Hence pO
j = 0.05 and
the admissions threshold is estimated to be ˆzj = 1.645. Since college j uses the same admissions
threshold for both open and direct applicants, we expect applicants with ability Ai ≥ 1.645 to be
accepted and applicants with ability Ai < 1.645 to be rejected.
Second, note again the ability of open applicants sent to college j is distributed N(0, 1), the ability
direct applicant’s to college j is distributed N(µD
j , 1) and each college makes offers to students with
expected exam results above their cut-off. Together these three statements imply the distribution of
26
ability for successful open applicants to college j follows a truncated normal distribution and similarly
for successful direct applicants. The truncations have the same cut-off point zj but the mean of the
truncated normal distributions may differ. This is shown in Figure 1.
Now consider an equation analogous to (9) but this time for direct applicants: ˆzD
j = Φ−1
(1 − pD
j )
where pD
j is the proportion of direct applicants, assigned to college j, who are offered a place at
college j. I refer to zD
j as the standardised cut-off for the ability of direct applicants zD
j .
Together (i) the true cut-off zj, (ii) the standardised cut-off for the ability of direct applicants zD
j
and (iii) the assumption that the standard deviation of ability for direct applicants is equal to the
standard deviation of the ability of open applicants: σD
= σO
= 1, give the mean ability of direct
applicants to college j µD
j through the equation:
zD
j =
zj − µD
j
σD
⇐⇒ µD
j = zj − σD
zD
j = zj − zD
j
Since zj and zD
j are unobservable, I use the estimator:
ˆµD
j = ˆzj − ˆzD
j (10)
Using the standard result for the mean of a truncated normal distribution gives an estimator for the
average ability of open and direct applicants given offers by college j:
E(AO
j |AO
j > zj) =
φ(zj)
1 − Φ(zj)
; E(AD
j |AD
j > zj) = µD
j +
φ(zD
j )
1 − Φ(zD
j )
(11)
where φ is the standard normal pdf and φ(.)
1−Φ(.) is the hazard function for the normal distribution.
Equation (11) gives estimates of average student ability for students enrolled at each college (which
is the average of the upper tail in the normal distributions in Figure 1). Next, use the linear function
form assumption for exam results given in equation (5) to estimate the parameters λ0 and λ1. By
definition, average realised exam results at college j for enrolled open applicants and enrolled direct
applicants are given by:
Y O
j =
1
O∗
j i EO
j
(λ0 + λ1A + cij + eij) ; Y D
j =
1
D∗
j i ED
j
(λ0 + λ1A + cij + eij)
where EO
j is the set of open applicants who were allocated to college j and who enrolled at college
j, ED
j is the set of direct applicants to college j and who enrolled at college j, O∗
j is the number of
27
open applicants who were allocated to college j and who enrolled at college j, D∗
j is the number of
direct applicants to college j and who enrolled at college j, Y O
j is the average realised exam result of
open applicants enrolled at college j and Y D
j is the average realised exam results of direct applicants
enrolled at college j. Now assume college effects are constant across students so cij = cj for all i.
Taking differences causes the college effect cj and the constant term λ0 to drop out:
Y O
j − Y D
j = λ1

 1
O∗
j i EO
j
Ai −
1
D∗
j i ED
j
Ai

 +
1
O∗
j i EO
j
eij −
1
D∗
j i ED
j
eij.
E(AO
j |AO
j > zj) − E(AD
j |AD
j > zj) can be used as an estimator for 1
O∗
j
i Ej
Ai − 1
D∗
j
i Ej
Ai for
each college j. Thus we can estimate λ1 using an OLS regression:
Y O
j − Y D
j = λ1 E(AO
j |AO
j > zj) − E(AD
j |AD
j > zj) +
1
O∗
j i Ej
eij −
1
D∗
j i Ej
eij (12)
with J observations, one for each college. This gives OLS estimates λ1. Note there is no constant in
this regression because λ0 has been differenced away. Unfortunately, heteroskedastic measurement
error in the explanatory variable will cause the OLS estimate of λ1 will be biased – the estimates
of mean ability of enrolled students contain estimation error and this estimation error differs across
observations (it is likely to be larger for colleges with fewer open applicants as this means that the
cut-off is less accurately estimated). Whilst methods exist to correct for heteroskedastic measurement
error in simple cases (Sullivan, 2001), correcting λ1 estimates is more complex and, as far as I am
aware, there is no appropriate method to correct for this.
Once we have λ1, we can back-out cO
j and cD
j which are estimates of college effects (inclusive of
the constant term λ0) for open applicants and direct applicants:
cO
j = Y O
j − λ1E(AO
j |AO
j > zj) ; cD
j = Y D
j − λ1E(AD
j |AD
j > zj)
Since we have assumed college effects are constant across students, cO
j and cD
j are also estimates
of the true college effects cj. A single college effect estimate can be obtained by taking a weighted
average of cO
j and cD
j , where the weights correspond to the number of students who took Prelims
28
exams:
cj =
O∗
j
O∗
j + D∗
j
cD
j +
D∗
j
O∗
j + D∗
j
cD
j . (13)
Finally, to make the results of Model 3 directly comparable to those from Model 1 and Model 2, I
present college effects relative to those of the best performing college, college J:
¯βj = cj − cJ . (14)
Implementing Model 3 in practice requires a number of decisions to be taken with regard to the
data. First, I decide to pool across years as done in Model 1 and Model 2. This increases preci-
sion by increasing the number of applicants (particularly open applicants) at each college. Pooling
applications across years is not ideal because it does not reflect how admissions are carried out in
practice, however open applicants will still be randomly allocated to college and if the distribution
of applicant ability is the same each year then cut-offs will be approximately the same across years.
Second, I only compare the subset of colleges with at least 50 open applicants (again to increase
precision). Third, whereas for Model 1 and Model 2, all students with Prelims scores are included
in the analysis, for Model 3, applicants not selected by the first college they were allocated to (these
students were “Rejected by College 1”) are not used in the analysis because their expected ability is
unknown. This means that Model 3 nests Model 1 as a special case where λ1 = 0 and where Model
1 is estimated on a reduced sample only containing applicants selected by the first college they were
allocated to.
5 Data
5.1 Why use Four Datasets?
I use four different datasets due to a trade-off between sample size and the availability of key covari-
ates. The largest dataset consists of anonymised data on all Oxford applicants in the years 2009-2013.
Information on these students was combined from two different sources. Firstly application records
obtained from the Student Data Management and Analysis (SDMA) team at Oxford University.
29
Table 1: Information Available in each Dataset
PPE E&M Law All Subjects
Personal Characteristics Y Y Y Y
Contextual Information Y Y Y Y
Previous School Type Y Y Y Y
GCSEs, A-levels and IB Y Y Y Y
Breakdown of A-levels by Subject N Y Y N
Admissions Test Scores Y Y Y N
Interview Scores N N Y N
School Reference N N N N
Personal Statement N N N N
Individual Paper Marks Y Y Y N
Second, for enrolled students, the application records were then linked to student records (also held
by the SDMA) through unique student identifiers. Exam results are contained in student records.
I refer to this large dataset as the “All Subjects” dataset because it covers all courses taught at
Oxford. Its obvious advantage is the large number of students. However focusing exclusively on
this large dataset is limiting for a number of reasons. First, given Model 2 relies on a selection on
observables assumption, it is important to condition on all relevant covariates used in the admissions
process. Time, resource and data availability constraints prevented the SDMA from supplying inter-
view scores, admissions test scores and specific A-level subjects taken for all students taking every
Oxford course. For courses where this information is missing, the selection on observables assumption
is much less credible. Second, observable ability controls included on the RHS may have a different
impact on exam results depending on the courses taken, e.g. the effect of an A-level in economics is
probably different if a student studies E&M rather than Law at Oxford. Third, college effects may
differ across courses, given that the quality of teaching may vary within colleges. Fourth, admissions
procedures are carried out at a course (department) level so the theoretical model in section 3, implies
open applicants are only randomly to colleges within subjects.
For these reasons I also analyse three other datasets containing information on PPE, E&M and
Law students respectively. I choose these courses because very detailed admissions data is available
for each of them and because they all receive large numbers of applications. The information available
in these datasets is summarised in Table 1.
30
5.2 Choice of Outcome Variable
Preliminary Examinations (“Prelims”) are the exams taken by students at the end of their first year
at Oxford. In PPE, E&M and Law students each take three first year papers, all marked out of 100.
Each script is marked blindly (so the marking tutors do not know which college the student comes
from). The main outcome variable I use is a student’s average Prelims score standardised within
cohort (and course for the All Subjects dataset). For instance, to construct my outcome variable for
PPE, I first take the average score across the three first year papers and then I then standardise the
result so the mean for each cohort is 0 and the standard deviation for each cohort is 1.
Standardising exam results by cohort is important because the distribution of exam scores var-
ies from year to year (partly due to variation in exam difficulty) even within the same course. I
also standardise by course for the All Subjects dataset because there is significant variation between
subjects in Prelims averages and this variation is mostly unrelated to college effectiveness.24
Stand-
ardising Prelims averages across subjects avoids penalising colleges that teach courses with lower
Prelims averages.25
Using Prelims average is preferable to estimating separate models for each Pre-
lims paper taken for two reasons. First, it increases precision. Second college effectiveness is very
likely to “spill over” across papers.
Research has demonstrated that better university exam performance is closely related to other
desirable outcomes which supports the exam based measurement of college effectiveness (Smith et al.,
2000; Walker and Zhu, 2013; Feng and Graetz, 2015; Naylor et al., 2015). One minor problem is that
interpreting Prelims scores is complicated by the that fact a small number of students retake papers.
Students only retake papers if they fail first time around. In this case the data I have corresponds the
highest mark they obtained which may be the first or second attempt. It would have been preferable
if I had the Prelims scores from first attempts. However retakes are rare so this should not be a
significant problem.
An obvious alternative outcome variable is Final Examination (“Finals”) results such as average
24The variation may reflect differences between subjects in the nature of the subject matter (arguably, natural
science exams are conducive to more extreme patterns of results) and in conventions within subjects of what is of
sufficient merit to be awarded a given mark.
25I don’t standardise marks for each individual paper because students and colleges may optimally concentrate their
teaching efforts on the Prelims papers that have a higher variance of marks.
31
score across Finals papers. However, Prelims results are preferred for a number of reasons. First,
attrition is greater with Finals (because more students drop out over time) and this implies more
missing data which can bias college effect estimates. Second, in Finals not every student takes the
same exams because of different option choices. This is problematic because there are differences in
score distributions across different options. Third, using Finals results involves excluding students
still in their first or second years at Oxford, substantially reducing the power of the analysis.26
However, when interpreting the results one should keep in mind that Prelims are less important to
students than Finals (they are “lower stakes” exams) and Prelims may over or underestimate Finals
college effects (underestimate because they give less time for any college effect to become evident and
because college effects may be cumulative. Overestimate because teaching is more college-focused in
first year than later years).
For these reasons I focus on standardised average Prelims scores in the main analysis but also
briefly consider the consequences of using individual first year paper scores and average Finals score
as outcome variables.
5.3 Choice of Control Variables
The control variables included in the analysis are summarised on the Table 2.
Most of the controls will be familiar to a UK audience. Less familiar may be contextual in-
formation27
, which is provided to admissions tutors in the form of “flags”, identifying disadvantaged
students. Admissions tutors are advised to use the contextual information to suggest extra candid-
ates to interview. The International Baccalaureate (IB) is an alternative to A-levels where students
complete assessments in six subjects. Each student gets a mark out of 45. The Thinking Skills
Assessment (TSA) is the admissions test for PPE and E&M applicants. It includes a 90-minute
multiple-choice test, marked by the Admissions Testing Service and the marks are made available
26Using final degree class as in the Norrington table, has the additional problem in that it is discrete and thus
discards lots of useful information concerning student achievement. This is particularly a problem at Oxford where
over 50% of students obtain a 2:1.
27It is sometimes argued that contextual information (and some personal characteristics such as gender and race),
should not be controlled for. This is because controlling for contextual information sets lower expectations for some
demographics. However, not taking these differences into account may penalise colleges that serve these students for
reasons that may be at least partly out of their control.
32
Table 2: Description of Control Variables
Personal Characteristics
Gender Dummy variable indicating whether the student is male or female
Ethnicity / Overseas status Dummy variables indicating: “UK White”; “UK Black”; “UK Asian”;
“UK Other ethnic group”; “UK Information refused”; “EU” and;
“Non-EU”
Contextual Information
Pre-16 School Flag Performance of applicant’s school at GCSE is below national average
Post-16 School Flag Performance of applicant’s school at A-level is below national average
Care Flag Applicant has been in-care for more than three months
Polar Flag Applicant’s postcode is in POLAR quintiles 1 and 2 - indicating lowest
rate of young people’s participation in Higher Education
Acorn Flag Applicant’s postcode is in Acorn groups 4 or 5 meaning residents are
typically categorised as ‘financially stretched’ or living in ‘urban
adversity’
Prior Educational Qualifications
Previous school type Dummy variables for State, Independent and other school type
GCSEs Dummy variables for proportion of A*s obtained at GCSE (if more
than 5 GCSEs). Categories are: “Band 1: 100%”; “Band 2: 75-99%”;
“Band 3: 50-74%”; “Band 4: < 50%” and; “Less than 5 GCSEs”
A-levels Dummy variables for A-level bands. The categories are: “Did not take
A-levels”, “Applied to start prior to 2010”, “Applied to start in 2010 or
later and no A*”, “1 A*”, “2 A*”, “3 A*” and “4 or more A*”
A-Level subjects Dummy variables indicating whether students had taken A-levels in
certain subjects. Subjects for E&M are Economics, Maths and Further
Maths. Subjects for Law are History and Law
A-Level subject grades Dummy variables indicating the grade achieved in included subjects
IB Dummy variables for IB bands. “Band 1: 45 (full marks)”; “Band 2:
{43, 44}”; “Band 3: {41, 42}”; “Band 4: ≤ 40” and; “Did not take IB”
Admissions Tests and Interviews
TSA Variables for TSA critical thinking score and TSA problem solving score
LNAT Variables for LNAT multiple choice score and LNAT essay score
Interview Score An interview score is given to each candidate out of 10.
33
to colleges. The Law National Admissions Test (LNAT) is the admissions test for Law applicants.
The LNAT includes a multiple choice section (machine marked out of 42) and an essay section (in-
dividually marked by colleges). Interviews are usually face-to-face with admissions tutors and most
candidates have have 2 interviews. Law students are given an interview score out of 10.
A quick note should also be made about using A-level grades, which is complicated by two
factors. First, a new A* grade was introduced in 2010. I create a separate A-level dummy variable
for students who applied before the A* grade was introduced. Second, most applicants are only
halfway through their A-levels when they apply to Oxford. In this case admissions tutors observe
predicted grades which are not available in the data. This should not be too problematic because
rational admissions tutors will make correct inferences on average about the actual A-levels grades
an applicant will achieve. Actual A-levels achieved are also probably a better measure of ability than
predicted grades.
5.4 Sample Selection
Sample selection involves choosing both a sample of applicants (only relevant for estimating cut-
offs in Model 3) and a sample of enrolled students (relevant for all three models). Fortunately, the
datasets contain only a very small amount of missing data. The missing data comes in two forms.
First, missing values of control variables for individuals who otherwise provide relatively complete
data. For example, a small number of students (12 in PPE, 39 in Law and 0 in E&M) are missing
admissions test scores perhaps because they were ill on the day of the test if or there were no available
test centres in their home countries (the vast majority are international students with many from
outside the EU). Imputing values for these missing covariates is possible. However, the advantages
of multiple imputation are minimal at best when missing data is less than 5% of the sample (Manly
and Wells, 2015). Multiple imputation also makes interpreting results more difficult (R2
can’t be
reported for example). I thus drop these observations (listwise deletion), which is standard practice in
the value-added literature. This choice should be taken into account when interpreting the resulting
college effect estimates.
Second, and more significantly, some students who matriculated at Oxford have missing Prelims
34
Table 3: Sample Selection: PPE
Applicant Sample (2009-2014)a
9867
Exclusions
Not Enrolled at Oxford -8404
Not in Cohorts 2009-14b
-7
Withdrew from Oxford -51
Exclude Extreme Outliersc
-2
No Admissions Test Scoresd
-12
Final Sample 1391
aApplicant sample excludes 53 students
who have student records but not application
record. This is likely to be because they ap-
plied pre-2009, before the dataset begins.
bThese students were offered deferred entry.
c2 students had Economics marks recorded
as 0 or 1. The next lowest mark is 30. It is
unclear whether these are typographical errors
or true marks.
d11 of the 12 students with missing ad-
missions test scores were international students
with 10 from non-EU countries.
Table 4: Sample Selection: All Subjects
Applicant Sample (2009-2013)a
75033
Exclusions
Not Enrolled at Oxford -61153
Not in Cohorts 2009-2013b
-76
No Prelims Averagec
-376
St Stephen’s College -1
Final Sample 14427
aExcludes all Medicine and Physiological
Science applicants as they are not given “marks”
in Prelims. Also excludes Classics I and Classics
II in the 2013 Ucas Cycle, Biomedical Science
in 2011 and 2012 and Japanese students in 2009
and 2010 as in each case their Prelims scores are
all missing.
bThese students were offered deferred entry.
c210 of these students have officially with-
drawn from Oxford and 8 are suspended.
Numbers per college range from 31 (Harris
Manchester) to 5 (Exeter and Hertford).
Table 5: Sample Selection: E&M
Applicant Sample (2009-2014) 6874
Exclusions
Not Enrolled at Oxford
Rejected Before Interview -4615
Rejected After Interview -1638
Declined Offer -24
Withdrew during Process -32
Failed to meet Offer Grades -30
Withdrew After Offer -1
Not in Cohorts 2009-14a
-2
Withdrew from Oxfordb
-15
Exclude Extreme Outliersc
-1
Final Sample 516
aThese students were offered deferred entry.
b4 from Pembroke. No more than 1 at any
other college.
cUnusually low TSA score.
Table 6: Sample Selection: Law
Applicant Sample (2007-2013) 8148
Exclusions
Not Enrolled at Oxford
Rejected Before Interview -4094
Rejected After Interview -2440
Declined Offer -59
Withdrew during Process -60
Failed to meet Offer Grades -136
Withdrew After Offer -1
Not in Cohorts 2007-13a
-10
Skipped Prelimsb
-31
Withdrew before Prelimsc
-49
No LNAT/interview scoresd
-39
Final Sample 1229
aThese students were offered deferred entry.
bMay have come to Oxford with a BA from
overseas and been allowed to transfer automat-
ically to year 2 without having to sit Prelims.
c16 from Harris Manchester. Less than 3
from most other colleges.
d24 of the 39 students with missing ad-
missions test scores were international students
with 22 from non-EU countries.
35
scores (51 for PPE, 49 in Law and 15 in E&M). The main reasons are (i) students dropping out
of Oxford during their first year and (ii) students taking a year out intending to return and repeat
their first year. I again use listwise deletion. This is not ideal because it rewards “cream skimming”
(encouraging weaker students not to take exams and perhaps dropout). Bias will result if having
missing Prelims scores is an indicator that the student was likely to under-perform relative to their
expected result given their pre-Oxford characteristics. Imputing missing prelims scores would also
not fully correct for bias. However, missing Prelims scores are rare and seem evenly spread across
colleges I do not expect biases to be large.2829
The sample selection criteria are summarised in Tables 3-6.
5.5 Descriptive Statistics
Tables 7 and 8 present application, offer and enrolment statistics for each college. The first two
columns show that most applicants to Oxford (e.g. over 80% in PPE) are direct applicants. There
is large variation in the numbers of direct applicants received by each college. For example, whereas
Balliol received 985 direct applications for PPE, St Hilda’s received only 69. The colleges with
relatively few direct applicants are allocated large numbers of open applicants (Balliol received 0
open applicants in PPE whereas St Hilda’s received 246). The tables show that almost all colleges
make offers to a higher proportion of direct applicants than they to do open applicants, suggesting
that the direct applicants are on average of higher ability. Consequently, over 90% of students who
take exams at Oxford are direct applicants rather than open applicants.
Tables 9-12 present descriptive statistics for applicants and exam takers for each dataset. Columns
1-3 present mean pre-Oxford characteristics of applicants. Columns 1-3 show that open applicants
are more likely than direct applicants to be international students (both from the EU or from outside
the EU). Open applicants also tend to perform less well in GCSEs, A-levels and admissions tests.
28An exception is a disproportionately large number of students dropout of Harris Manchester which may be related
to the fact Harris Manchester is a college for “mature students”.
29If cream skimming is taking place, we might expect to see a positive correlation between college effectiveness
estimates and the share of a college’s students that are missing exam results. However, the correlation between the
selection on observables estimates and the share of dropouts is −0.86 for PPE, −0.40 for E&M and 0.09 for Law. If
anything, the opposite is the case - less effective colleges tend to have larger shares of dropouts.
36
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided
David Thesis Final 1 Sided

More Related Content

Viewers also liked

Speaking Tiger Catalogue
Speaking Tiger CatalogueSpeaking Tiger Catalogue
Speaking Tiger Catalogue
Speaking Tiger Publishing Pvt. Ltd.
 
Valerie_Swanson-Portfolio
Valerie_Swanson-PortfolioValerie_Swanson-Portfolio
Valerie_Swanson-PortfolioValerie Swanson
 
Makalah bisnis sosialisme dan komunisme
Makalah bisnis   sosialisme dan komunismeMakalah bisnis   sosialisme dan komunisme
Makalah bisnis sosialisme dan komunisme
Erwin Sugito
 
Taller 3 11 5 (2)
Taller 3 11 5 (2)Taller 3 11 5 (2)
Taller 3 11 5 (2)
Vanessa España
 
Fire safety mnemonics
Fire safety mnemonicsFire safety mnemonics
Fire safety mnemonics
nadyapratt
 
MANUAL TÉCNICO
MANUAL TÉCNICO MANUAL TÉCNICO
MANUAL TÉCNICO
Winker Morales
 
Anibal martinez
Anibal martinezAnibal martinez
Anibal martinez
yobani martinez
 
Designers rooms: Welcome to the Made in Spain
Designers rooms: Welcome to the Made in SpainDesigners rooms: Welcome to the Made in Spain
Designers rooms: Welcome to the Made in Spain
Gabriela Marengo
 
mass communication
mass communicationmass communication
mass communication
eba ali
 
Kinsiology Project
Kinsiology ProjectKinsiology Project
Kinsiology Project
Danny Euresti Jr.
 
Native americans group 5
Native  americans group 5Native  americans group 5
Native americans group 5
Mark Hebert
 
cost Savings
cost Savingscost Savings
Management system of pepsi co
Management system of pepsi coManagement system of pepsi co
Management system of pepsi co
Owais Hassan
 

Viewers also liked (14)

Speaking Tiger Catalogue
Speaking Tiger CatalogueSpeaking Tiger Catalogue
Speaking Tiger Catalogue
 
Valerie_Swanson-Portfolio
Valerie_Swanson-PortfolioValerie_Swanson-Portfolio
Valerie_Swanson-Portfolio
 
Makalah bisnis sosialisme dan komunisme
Makalah bisnis   sosialisme dan komunismeMakalah bisnis   sosialisme dan komunisme
Makalah bisnis sosialisme dan komunisme
 
DISC_TC_SYFY
DISC_TC_SYFYDISC_TC_SYFY
DISC_TC_SYFY
 
Taller 3 11 5 (2)
Taller 3 11 5 (2)Taller 3 11 5 (2)
Taller 3 11 5 (2)
 
Fire safety mnemonics
Fire safety mnemonicsFire safety mnemonics
Fire safety mnemonics
 
MANUAL TÉCNICO
MANUAL TÉCNICO MANUAL TÉCNICO
MANUAL TÉCNICO
 
Anibal martinez
Anibal martinezAnibal martinez
Anibal martinez
 
Designers rooms: Welcome to the Made in Spain
Designers rooms: Welcome to the Made in SpainDesigners rooms: Welcome to the Made in Spain
Designers rooms: Welcome to the Made in Spain
 
mass communication
mass communicationmass communication
mass communication
 
Kinsiology Project
Kinsiology ProjectKinsiology Project
Kinsiology Project
 
Native americans group 5
Native  americans group 5Native  americans group 5
Native americans group 5
 
cost Savings
cost Savingscost Savings
cost Savings
 
Management system of pepsi co
Management system of pepsi coManagement system of pepsi co
Management system of pepsi co
 

Similar to David Thesis Final 1 Sided

A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...
A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...
A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...
Joaquin Hamad
 
The relationship between school climate and student growth
The relationship between school climate and student growthThe relationship between school climate and student growth
The relationship between school climate and student growth
Siti Khalijah Zainol
 
HonsTokelo
HonsTokeloHonsTokelo
HonsTokelo
Tokelo Khalema
 
OpenCred REport published
OpenCred REport publishedOpenCred REport published
OpenCred REport published
Andreia Inamorato dos Santos
 
Validation of NOn-formal MOOC-based Learning
Validation of NOn-formal MOOC-based LearningValidation of NOn-formal MOOC-based Learning
Validation of NOn-formal MOOC-based Learning
eraser Juan José Calderón
 
final
finalfinal
Teacher and student perceptions of online
Teacher and student perceptions of onlineTeacher and student perceptions of online
Teacher and student perceptions of online
waqasfarooq33
 
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Stefano Bochicchio
 
938838223-MIT.pdf
938838223-MIT.pdf938838223-MIT.pdf
938838223-MIT.pdf
AbdetaImi
 
The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...
The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...
The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...
Stockholm Institute of Transition Economics
 
Ap08 compsci coursedesc
Ap08 compsci coursedescAp08 compsci coursedesc
Ap08 compsci coursedesc
htdvul
 
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
Rob Moore
 
T. Davison - ESS Honours Thesis
T. Davison - ESS Honours ThesisT. Davison - ESS Honours Thesis
T. Davison - ESS Honours Thesis
Tom Davison
 
Smart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaSmart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao Pargana
Hendrik Drachsler
 
Full thesis
Full thesisFull thesis
Full thesis
areefauzi89
 
ACT_RR2015-4
ACT_RR2015-4ACT_RR2015-4
ACT_RR2015-4
Mary Ann Hanson, PhD
 
Does online interaction with promotional video increase customer learning and...
Does online interaction with promotional video increase customer learning and...Does online interaction with promotional video increase customer learning and...
Does online interaction with promotional video increase customer learning and...
rossm2
 
Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases
 Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases
Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases
Global Medical Cures™
 
Knustthesis
KnustthesisKnustthesis
Business rerearch survey_analysis__intention to use tablet p_cs among univer...
Business rerearch  survey_analysis__intention to use tablet p_cs among univer...Business rerearch  survey_analysis__intention to use tablet p_cs among univer...
Business rerearch survey_analysis__intention to use tablet p_cs among univer...
Dev Karan Singh Maletia
 

Similar to David Thesis Final 1 Sided (20)

A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...
A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...
A Mini-Thesis Submitted For Transfer From MPhil To PhD Predicting Student Suc...
 
The relationship between school climate and student growth
The relationship between school climate and student growthThe relationship between school climate and student growth
The relationship between school climate and student growth
 
HonsTokelo
HonsTokeloHonsTokelo
HonsTokelo
 
OpenCred REport published
OpenCred REport publishedOpenCred REport published
OpenCred REport published
 
Validation of NOn-formal MOOC-based Learning
Validation of NOn-formal MOOC-based LearningValidation of NOn-formal MOOC-based Learning
Validation of NOn-formal MOOC-based Learning
 
final
finalfinal
final
 
Teacher and student perceptions of online
Teacher and student perceptions of onlineTeacher and student perceptions of online
Teacher and student perceptions of online
 
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
 
938838223-MIT.pdf
938838223-MIT.pdf938838223-MIT.pdf
938838223-MIT.pdf
 
The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...
The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...
The Impact of Early Grading on Academic Choices: Mechanisms and Social Implic...
 
Ap08 compsci coursedesc
Ap08 compsci coursedescAp08 compsci coursedesc
Ap08 compsci coursedesc
 
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
 
T. Davison - ESS Honours Thesis
T. Davison - ESS Honours ThesisT. Davison - ESS Honours Thesis
T. Davison - ESS Honours Thesis
 
Smart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaSmart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao Pargana
 
Full thesis
Full thesisFull thesis
Full thesis
 
ACT_RR2015-4
ACT_RR2015-4ACT_RR2015-4
ACT_RR2015-4
 
Does online interaction with promotional video increase customer learning and...
Does online interaction with promotional video increase customer learning and...Does online interaction with promotional video increase customer learning and...
Does online interaction with promotional video increase customer learning and...
 
Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases
 Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases
Global Medical Cures™ | Emerging & Re-Emerging Infectious Diseases
 
Knustthesis
KnustthesisKnustthesis
Knustthesis
 
Business rerearch survey_analysis__intention to use tablet p_cs among univer...
Business rerearch  survey_analysis__intention to use tablet p_cs among univer...Business rerearch  survey_analysis__intention to use tablet p_cs among univer...
Business rerearch survey_analysis__intention to use tablet p_cs among univer...
 

David Thesis Final 1 Sided

  • 1. Gryffindor or Slytherin? The effect of an Oxford College David Lawrence∗ Supervisor: Dr Johannes Abeler Submitted in partial fulfilment of the requirements for the degree of Master of Philosophy in Economics Department of Economics University of Oxford Trinity Term 2016 ∗I would like to thank my supervisor, Johannes Abeler, for the patient guidance, encouragement and advice he has provided throughout my time as his student. I have been extremely lucky to have a supervisor who cared so much about my work, and who responded to my questions and queries so enthusiastically and promptly. I am very grateful to Dr Gosia Turner in Student Data Management and Analysis at Oxford University for providing the data and answering my many questions about it. Valuable comments were received from Theres Lessing, Jonas Mueller-Gastell, Leon Musolff and Matthew Ridley. This work was supported by the Economic and Social Research Council. Word count: 29,904 (356 words on page 2, including footnotes, multiplied by 84 pages, including the title page)
  • 2. Abstract Students at Oxford University attend different colleges. Does the college a student attends matter for their examination results? To answer this question, I use data on all Oxford applicants and entrants between 2009 and 2013, focusing primarily on Preliminary Examination (Prelims) results for 3 courses: Philosophy, Politics and Economics (PPE), Economics and Management (E&M) and Law. I use two methods to account for the possibility student ability differs systematically between colleges. First, I control for “selection on observables” by running an OLS regression on college dummy variables and variables capturing almost all information available to admissions tutors. Results show that colleges matter statistically and practically. Colleges have a modest impact on average Prelims scores, similar to the impact secondary schools have on GCSE results. A one standard deviation increase in college effectiveness leads to a 0.11 standard deviation increase in PPE average Prelims score. The equivalent figures are 0.15 for E&M, 0.14 for Law and 0.09 for all courses combined. Second, I take advantage of a special feature of the Oxford admissions process – that “open applicants” are randomly assigned to colleges – to control for “selection on observables and unobservables”. Results suggest differences in college effectiveness are large and accounting for unobservable ability can change college effectiveness estimates considerably. However, the results are very imprecise so it is difficult to draw strong conclusions. I also test whether my college effectiveness estimates can be explained by college characteristics and find college endowment and peer effects, operating through the number of student per course within a college, are related to college effectiveness. Keywords: Oxford, college effectiveness, selection bias, selection on observables and unobservables, examination results ii
  • 3. Contents 1 Introduction 1 1.1 Prior Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Institutional Background 8 3 Theoretical Model 9 3.1 Defining College Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 College Admissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Applications and Applicant Ability . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Application Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.3 Enrolment Probabilities and Expected Exam Results . . . . . . . . . . . . . . . 12 3.2.4 The College Admissions Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Econometric Models 16 4.1 Model 1 – Norrington Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Model 2 – Selection on Observables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Model 3 – Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . . 25 5 Data 29 5.1 Why use Four Datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Choice of Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 Choice of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.5 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.5.1 Testing Assumptions for Selection on Observables and Unobservables . . . . . 43 6 Results 45 6.1 Results for Norrington Table Plus and Selection on Observables . . . . . . . . . . . . 45 6.2 Robustness Checks for Norrington Table and Selection on Observables . . . . . . . . . 54 6.2.1 Alternative Outcome Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.2 Interval Scale Metric Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.3 Heterogeneity in College Effectiveness across Students of Different Types . . . 59 6.3 Results for Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . 60 7 Characteristics of Effective Colleges 65 8 Discussion and Limitations 70 9 Conclusion and Future Work 72 A Proof of Proposition 1 79 iii
  • 4. List of Tables 1 Information Available in each Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2 Description of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Sample Selection: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Sample Selection: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Sample Selection: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 Sample Selection: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7 Application, Offer and Enrolment Statistics: PPE and E&M . . . . . . . . . . . . . . . 37 8 Application, Offer and Enrolment Statistics: Law and All Subjects . . . . . . . . . . . 38 9 Mean Applicant and Exam Taker Characteristics: PPE . . . . . . . . . . . . . . . . . 39 10 Mean Applicant and Exam Taker Characteristics: E&M . . . . . . . . . . . . . . . . . 40 11 Mean Applicant and Exam Taker Characteristics: Law . . . . . . . . . . . . . . . . . . 41 12 Mean Applicant and Exam Taker Characteristics: All Subjects . . . . . . . . . . . . . 42 13 Tests for Differences in Mean and Variance of Applicant Ability across Colleges . . . . 42 14 P-values from Balance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 15 Regressions: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 16 Regressions: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 17 Regressions: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18 Regressions: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 19 Correlation in College Effects across Courses . . . . . . . . . . . . . . . . . . . . . . . 54 20 Alternative Dependent Variable Regressions: PPE . . . . . . . . . . . . . . . . . . . . 56 21 Alternative Dependent Variable Regressions: E&M . . . . . . . . . . . . . . . . . . . . 57 22 Alternative Dependent Variable Regressions: Law . . . . . . . . . . . . . . . . . . . . . 58 23 P-values from Tests for Heterogeneity in College Effects across Students . . . . . . . . 60 24 Selection on Observables and Unobservables Results for various λ1: PPE, E&M and Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 25 Selection on Observables and Unobservables Results: All Subjects, English, Maths and History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 26 Second Stage Regression Results: Impact of Endowment . . . . . . . . . . . . . . . . . 67 27 Second Stage Regression Results: Evidence of Peer Effects . . . . . . . . . . . . . . . . 68 List of Figures 1 Applicant Ability and College Admissions Decisions . . . . . . . . . . . . . . . . . . . 15 2 College Ranking by Course: Norrington Table Plus vs Selection on Observables . . . . 48 3 Comparison of Selection on Observables College Ranking across Courses . . . . . . . . 54 4 Comparison of College Rankings across Models: All Subjects . . . . . . . . . . . . . . 64 iv
  • 5. 1 Introduction The popular Harry Potter novels of J.K. Rowling are set in the fictional Hogwarts School of Witchcraft and Wizardry where all the students are magically assigned by a “sorting hat” to one of four houses: Gryffindor, Slytherin, Hufflepuff, and Ravenclaw. Oxford University is organised in a similar way to Hogwarts. Oxford divides students into colleges, just as Hogwarts divides students into houses. The college a student attends can influence not only the facilities available to them (like catering services and libraries), their accommodation and their peers but also the teaching they receive. In this paper I address two basic questions that arise in the context of Oxford colleges. First, to what extent do colleges “make a difference” to student outcomes? Second are any differences in college effectiveness1 captured by college characteristics such as endowment, age and size? To answer these questions I use admissions and examination (exam) data on all Oxford applicants and entrants between 2009 and 2013, focusing on how exam results (specifically first year “Prelims” results) vary across colleges in three particular courses: Philosophy, Politics and Economics (PPE), Economics and Management (E&M) and Law as well as across all courses (“All Subjects”). The key complication in answering these questions is selection bias. Selection into colleges is non-random and thus student ability may differ systematically between colleges. Selection occurs: (i) at the application stage (students choose to apply to one college and not to others); (ii) at the admissions stage (admission tutors take decisions to make offers to some students and not others); and (iii) at the enrolment stage (students with offers decide whether they want to accept the offer). Non-random selection into colleges can be based on observables characteristics (e.g. prior attainment) and unobservable characteristics (e.g. motivation) which may themselves be correlated with exam results. Failure to adequately control for such selection would lead to biased estimates of college effectiveness, favouring colleges with higher ability students. To overcome the problem of selection bias I employ two empirical methods. First, I estimate an OLS regression which identifies college effects only under a “selection on observables” assump- tion. Detailed data on almost all variables used by admissions tutors provides some support for 1I use the term “college effectiveness” to mean the contribution of colleges to student examination results. I use “college effectiveness”, “college effect” and “college quality” interchangeably. 1
  • 6. this assumption. Nevertheless, concern remains that “selection on unobservables” may bias college effectiveness estimates. Second, I take advantage of a special feature of the Oxford admissions process: some applicants choose to make an “open application”. These applicants do not apply directly to a college, instead their application profiles are randomly allocated between those colleges that receive relatively few direct applicants. Intuitively, random assignment implies all colleges receive open applicants with equal ability on average. Hence, in relative terms, colleges accepting a large proportion of open applicants allocated to them must have received weak direct applications and have low admissions standards, while colleges that accept a low proportion of open applicants must have received strong direct applications and have high admissions standards. I formalise this intuition in a theoretical model. Given additional assumptions concerning the distribution of applicant ability, this method can account for “selection on observables and unobservables”. Exam results differences across colleges remaining after controlling for both observables and unobservables can be considered a measure of college effectiveness or alternatively college “value-added”.2 My results reveal colleges matter. A simple comparison of average exam results suggests large differences between colleges. When I account for observable student characteristics, exam result differences shrink because high ability students tend to attend more effective colleges. The vast majority of variation in exam results is due to between-student differences. However, even after controlling for observables there remains strong evidence that colleges differ in their effectiveness in boosting student exam results – college effectiveness differences are statistically and practically significant in all courses I consider. A one standard deviation increase in college effectiveness leads to a 0.11 standard deviation increase in Prelims average score in PPE (a 0.65 mark increase). This would be enough to move a 50th percentile student up to the 55th percentile. The estimated standard deviation of college effectiveness is 0.15 for E&M, 0.14 for Law and 0.09 across All Subjects. College effectiveness differences are comparable to school effectiveness differences and slightly lower than teacher effectiveness differences. 2Although widely used, the “value-added” term is questionable because inputs and outputs are measured in different units (Goldstein and Spiegelhalter, 1996). 2
  • 7. I also produce course-specific college rankings that improve on the Norrington table3 as they account for observable student characteristics. College rankings at an aggregate level are of limited use because college effectiveness differs across courses – hence my focus attention on courses within colleges. Course-specific college rankings are subject to large confidence intervals because of the low number of students per course at each colleges. Accounting for selection on unobservable student characteristics would likely further change the results. Unfortunately for PPE, E&M and Law, estimation error prevents me from obtaining point estimates for the effectiveness of each college (as only a small number of open applicants enrol at Oxford). Instead I present college effectiveness estimates for different parameterisations of the relationship between prior ability and exam results. I do obtain college effectiveness estimates for some other courses (English, Maths and History) and for All Subjects combined.4 The results suggest variation in college effectiveness remains large and that unobservable ability can dramatically change college effectiveness estimates. However, the estimates are imprecise so it is difficult to reach strong conclusions. Having established that college effects exist, I use a second stage regression to examine whether they can be explained by college characteristics. The most interesting finding is evidence that peer effects, operating through the number of students per college studying the same course, contribute to college effectiveness. Reversal causality is also possible – if a college happens to be strong in one subject for whatever reason, they will be likely to hire more fellows and thus increase the size of the cohort at that college. If there are benefits to clustering together students studying the same subject then a potential policy implication would be to close small, under-performing courses within a college. There is also evidence that richer colleges are more effective than poorer colleges. However, given that college effectiveness is imperfectly correlated across courses, it seems likely that college effectiveness is primarily determined by course-specific variables related to teaching and peer effects. Overall, much of the variation in college effectiveness remains unexplained. The results of this study may be of interest to a number of different audiences. First, it may 3The Norrington table, published each year, documents the degree outcomes of students at each Oxford college. It ranks colleges using the Norrington score, devised in the 1960s by Sir Arthur Norrington, which attaches a score to degree classifications and expresses the overall calculation for each college as a percentage. 4Though aggregating across courses makes the random assignment of open applicants far less credible. 3
  • 8. interest economists studying the educational production function. At a school level, economists have struggled to identify a systematic relationship between school resources and academic performance. This study informs us about the relationship between college resources and academic performance. Second, this study can help prospective students deciding which college to apply to. An Oxford college education is an experience good, with quality difficult to observe in advance and only really ascertained upon consumption. Thus the application decisions of prospective students are likely to be based on imperfect information. This paper shows attending a high quality college can boost students’ exam results which is important given the substantial economic return to better university exam performance. Better exam performance at UK universities is closely related to entering further study (Smith et al., 2000), employment (Smith et al., 2000), industry choice (Feng and Graetz, 2015), short-run earnings (Feng and Graetz, 2015; Naylor et al., 2015) and lifecycle earnings (Walker and Zhu, 2013). For example, Feng and Graetz (2015) study students from the London School of Economics and find the causal wage payoff 12 months after graduating with a First compared with an Upper Second is a 3% higher expected wage. The difference between an Upper Second and a Lower Second is 7% higher wages. Thus there should be demand by applicants for third parties evaluations of college quality just as there is demand for league tables of university quality (Chevalier and Jia, 2015). My college effectiveness estimates help to fill this gap in the market – they improve on the unadjusted college rankings currently available to prospective students in the Norrington table.56 Third, my analysis may be of interest to Oxford colleges themselves. Colleges need to measure past effectiveness relative to other colleges for a number of reasons. It allows them to learn best practices from, and share problems with, other colleges, evaluate their own practices, allocate resources more efficiently and plan and set targets for the future. Yet currently colleges receive scant feedback on their past performance in raising exam results and the information they do receive from the Norrington table can be misleading or demoralising due to selection bias – Norrington table rank may be more 5Of course, exam based rankings are only a starting point for application decisions and should complement other information about colleges’ quality (such as cost, location, accommodation and facilities) from publications, older siblings, friends at Oxford and personal visits to colleges. 6More informed students may create dynamic effects as they would then be able to “vote with their feet” like consumers in a Tiebout model. On the one hand, this may drive up college quality by increasing competition between colleges. On the other hand, as pointed out by Lucas (1980), when criticising the Norrington table, it may increase inequality in raw exam results between colleges because lower ranked colleges would find it difficult to recruit high ability students. Increased competition may also discourage colleges from cooperating with each other. 4
  • 9. informative about who their students are than how they were taught. My estimates provide a better picture of a college’s performance. Furthermore, my analysis suggests colleges effectiveness may be increased by admitting larger number of students per course, perhaps colleges should concentrate on a narrower range of courses. Even small improvements in college effectiveness are important, because they might be cumulative and because they refer to a large number of students.7 1.1 Prior Literature This is the first study of differences between Oxford colleges. However my paper is related to various literatures interested in measuring differences in effectiveness across teachers, schools and universities. First, there is a large and active literature (much done by economists) on the value-added of teachers in schools (Hanushek, 1971; Chetty et al., 2013a,b; Koedel et al., 2015) and universities (Carrell and West, 2008; Waldinger, 2010; Illanes et al., 2012; Braga et al., 2014). Empirical evid- ence shows students are not randomly assigned to teachers, even within schools or universities (e.g. Rothstein (2009)). To account for non-random assignment, teacher value-added models use similar methods to those in this paper – either “selection on observables” where observables include student and family input measures and a lagged standardised test score or random assignment of students to teachers (Nye et al., 2004; Carrell and West, 2008). The main conclusions of teacher value-added studies also mirror my findings. Teachers like colleges vary in their effectiveness (Nye et al., 2004; Ladd, 2008; Hanushek and Rivkin, 2010; Braga et al., 2014). Within schools, Nye et al. (2004), reviews 18 early studies of teacher value-added. Using the same method I use (though I correct for measurement error), they find a median standard deviation of teacher effectiveness of 0.34. Hanushek and Rivkin (2010) review more recent studies and report estimates, adjusted for measurement er- ror, that range from 0.08 to 0.26 (average 0.11) using reading tests and 0.11 to 0.36 (average 0.15) in maths. They conclude the literature leaves “little doubt that there are significant differences in teacher effectiveness” (p. 269). Within universities, Braga et al. (2014) find a one standard deviation increase in teacher quality leads to a 0.14 standard deviation increase in Economics test scores and a 7Estimates of effectiveness similar to mine are often used for teacher and school accountability purposes. However, for reasons detailed in section 8, I do not believe my college effect estimates should be used to hold colleges to account. 5
  • 10. 0.22 standard deviation increase in Law and Management test scores. Overall, teacher effects appear slightly larger than the college effects I find (0.09 - 0.15). However, there is no consistent relationship between teacher effectiveness and observable teacher characteristics such as education, experience or salary (Burgess, 2015). Second, there is a literature on the value-added of schools (though only some by economists) (Aitkin and Longford, 1986; Goldhaber and Brewer, 1997; Ladd and Walsh, 2002; Rubin et al., 2004; Reardon and Raudenbush, 2009). Again similar empirical strategies are used, though non-economists tend to use random effect models whereas economists favour fixed effect models. Although school effectiveness is found to impact test scores, there is a consistent finding that schools, like colleges, have less impact on test scores than teachers with most estimates in the range 0.05-0.20 (Nye et al., 2004; Konstantopoulos, 2005; Deutsch, 2012; Deming, 2014).8 In one of the most credible studies, Deutsch (2012) takes advantage of a school choice lottery to estimate a school effect size, adjusted for measurement error, of 0.12. School effect sizes seem similar to college effect sizes. Thomas et al. (1997), for example, find the standard deviation in total GCSE performance between schools is 0.10 when pooled across all subjects and is higher in individual subjects ranging from 0.13 in English to 0.28 in History. This closely mirrors my results in terms terms of the size of school (college) effects, the variation across subjects (courses) and the fact there is less variation in effectiveness once subjects (courses) are pooled together. Therefore the impact of colleges on exam results appears similar to the impact of schools on GCSE results. This literature also finds school resources have only a weak relationship with test scores, leaving much variation in school effectiveness unexplained (Hanushek, 2006; Burgess, 2015). Third, a small number of studies have attempted to measure university effects on degree out- comes (Bratti, 2002), student satisfaction Cheng and Marsh (2010), standardised test scores (Klein et al., 2005) and earnings (Miller III, 2009; Cunha and Miller, 2014). In the attempt to account for selection bias, “selection on observables” methods have been used exclusively. Results suggest large unconditional differences in outcomes across universities with observable student covariates 8School effect sizes differ depending on the age of the students – they are highest in Kindergarden, fall as students become older until bottoming out around GCSE age and rising again in the 6th form (e.g. Goldstein and Sammons (1997) and Fitz-Gibbon (1991)). 6
  • 11. accounting for a substantial portion, but not all of these differences (Miller III, 2009; Cunha and Miller, 2014). Observable university characteristics explained only a small proportion of variation in university value-added (Bratti, 2002). Beyond “value-added”, this paper is related to the research done by economists on the effect on earnings from attending a higher “quality” university, where “quality” is usually defined in terms of mean entry grade, expenditure per student, student/staff ratio and/or ranking in popular league tables (Dale and Krueger, 1999; Black and Smith, 2004, 2006). Conceptually measuring the return to institution quality is quite different to my analysis focusing on institution effectiveness. Whereas I attempt to estimate quality directly, this literature takes quality as given and attempts to estimate the labour market return to a higher quality. Nevertheless, the university quality literature is interesting to consider because it has found interesting ways to tackle the non-random selection of students into universities (better students sort into higher quality colleges). Studies tend to aggregate universities into a small number of quality groups, thereby reducing the dimensionality of the selection problem. This facilitates the use of selection on observables based on OLS (James et al., 1989; Black et al., 2005), selection on observables based on matching (Black and Smith, 2004; Chevalier, 2014) and methods to account for selection on unobservables including regression discontinuity (Saavedra, 2009; Hoekstra, 2009), instrumental variables (Long, 2008) and applicant group fixed effects (Dale and Krueger, 1999, 2014; Broecke, 2012).9 However, no study in this literature has had the opportunity to exploit random assignment, as I am able to do. The rest of the paper is organised as follows: Section 2 briefly explains the institutional back- ground. Section 3 lays out a theoretical model of Oxford admissions that defines college effects. Section 4 explains the problem of selection bias and outlines econometric models that account for “selection on observables” and “selection on observables and unobservables” respectively. Section 5 describes the data. Section 6 presents the results. Section 7 considers whether college characteristics 9I considered, but ultimately rejected, using these methods to account for selection on unobservables. For instance, matching could be applied to Oxford colleges with only minimal complications, such as in Davison (2012), but would do nothing to help account for unobservables. Instrumental variables requires finding over 30 valid instruments, one for each college, which is a formidable challenge. Applicant group fixed effects, work better in a university context than a college context because they face a multicollinearity problem when students apply to only one college (see discussion in Miller III (2009)). In addition, applicant group fixed effects make the strong assumption that students apply to colleges in a rational way. I did estimate regressions with applicant group fixed effects but the results were unconvincing and are not reported. 7
  • 12. can explain differences in college effectiveness. Section 8 discusses limitations and section 9 concludes. Proofs are collected in the appendix. 2 Institutional Background The college model is one of the oldest forms of academic organisation in existence. It originated 700 years ago in the UK and was long confined to the universities of Oxford, Cambridge, and Durham. Today however, college systems have spread worldwide. College systems now operate at several other British universities including Bristol, Kent and Lancaster. In the US, Harvard, Yale and others have established similar college systems. College systems are also common in Canada, Australia, and New Zealand and are present in a numerous other countries from Mexico to China (O’Hara, 2016). Oxford University can be thought of as consisting of two parts – (1) a Central Administration and (2) the 32 colleges.10 The Central Administration is composed of academic departments, re- search centres, administrative departments, libraries and museums. The Central Administration (i) determines the content of the courses within which college teaching takes place, (ii) organises lectures, seminars and lab work, (iii) provides resources for teaching and learning such as libraries, laborator- ies, museums and computing facilities, (iv) provides administrative services and centrally managed student services such as counselling and careers and (v) sets and marks exams, and awards degrees. The colleges are self-governing, financially independent and are related to the Central Administra- tion in a federal system not unlike the federal relationship between of the 51 states of America and the US Federal Government. The colleges (i) select and admit undergraduate students, (ii) provide accommodation, meals, common rooms, libraries, sports and social facilities, and pastoral care for their students and (iii) are responsible for tutorial teaching for undergraduates. Thus Oxford colleges play a significant role in university life, making Oxford an ideal place to study college effects. 10There are also five Permanent Private Halls at Oxford admitting undergraduates. They tend to be smaller than colleges, and offer fewer subjects but are otherwise similar. From now on I include them when I refer to “colleges”. 8
  • 13. 3 Theoretical Model In this section I develop a theoretical model of college admissions. The model serves two main purposes. First, it allows me to formally define the “effect” of attending an Oxford college. A failure to clearly define the causal effect of interest has been a criticism of much of the school effect literature (Rubin et al., 2004; Reardon and Raudenbush, 2009). Second, the model motivates the empirical strategies I employ to identify college effects in section 4. 3.1 Defining College Effects There are a total of N applicants to Oxford indexed i = 1, 2, ..., N and J colleges indexed j = 1, 2, . . . , J. For each student i there exist J potential exam results Y 1 i , Y 2 i , . . . .Y J i , where Y j i denotes the exam result at some specified time (such as end of year 1) that would be realised by individual i if he or she attended college j. Let each potential exam result depend on pre-admission ability Ai, a 1 x K row vector. Ai permits multiple sources of ability which may be observable or unobservable. It should be interpreted broadly to include not only cognitive ability but also motivation. Potential exam results also depend on college effects cij, which are allowed to vary across students, and a possibly heteroskedastic random shock eij, uncorrelated with ability and representing measurement error in exam results such as illness on the day of the exam and subjective marking of exams. The potential exam result obtained by an individual i who attends college j is: Y j i = Y j i (Ai, cij, eij). (1) For student i the causal effect of attending college j as opposed to college k is the difference in potential outcomes Y j i − Y k i . The main focus of this paper is on estimating the average causal effect of college j relative to a reference college k for the subpopulation of n ≤ N students who actual enrol at Oxford (denoted by the set E). This average causal effect of college j relative to college k is: ¯βj = cj − ck = 1 n i∈E cij − 1 n i∈E cik. (2) Focusing on the subpopulation of students who attend Oxford, rather than the full population of applicants, makes sense because many applicants (perhaps due to weak prior achievement at 9
  • 14. school) may have only a low chance of attending Oxford. The definition college effects relies on two assumptions. Assumption 1. “Manipulability”: Y j i exists for all i and j Assumption 1 is the assumption of manipulable college assignment (Rosenbaum and Rubin, 1983; Reardon and Raudenbush, 2009). It says each student has at least one potential outcome per college. Intuitively to talk about the effect of college j one needs to be able to imagine student i attending college j, without changing the student’s prior characteristics Ai. “Manipulability” would be violated, for instance, if a college only accepted women implying the potential outcome of a male student at that college may not exist. This assumption is relatively unproblematic at Oxford (certainly compared to schools or universities). Oxford colleges are not generally segregated by student characteristics11 so it is not difficult to imagine Oxford applicants attending different colleges. Randomness in the admissions process also makes it possible that all applicants have at least some chance, however small, of being offered a place at an Oxford college. Assumption 2. “No interference between units” : Y j i is unique for all i and j Assumption 2 says each student possesses a maximum of one potential exam result in each college, regardless of the colleges attended by other students (Reardon and Raudenbush, 2009). The “no interference between units” assumption of Cox (1958) is one part of the “Stable Unit Treatment Value Assumption” (or SUTVA; Rubin, 1978). Strictly speaking, this means that a given student’s exam result in a particular college does not depend on who his college peers are (or even how many of them there are). Evidence of peer effects in education make this assumption questionable (e.g. Feld and Zölitz, 2015). Without it, however, we must treat each student as having as JN potential outcomes, one for each possible permutation of students across colleges. Thus adopting the no interference assumption makes the problem of causal inference tractable (at the cost of some plausibility). The consequences of violations of this assumption on the estimates of college effects are unclear, since without it the causal effects of interest are not well-defined. 11St Hilda’s, the last all women’s college started accepted men in 2008. An exception is colleges that accept only mature students such as Harris Manchester. 10
  • 15. 3.2 College Admissions 3.2.1 Applications and Applicant Ability Responsibility for admissions is devolved at the college level, then again at the course level. To save notation, let all applicants apply for the same course. College j is allocated (receives the application profiles of) Dj direct applicants and Oj open applicants to consider for admission. The direct applicants received by college j are the students who expressed a preference for college j on their application forms - they applied directly to college j. In total there are D1+D2+. . .+DJ = D direct applicants to Oxford. Let the ability of direct applicants to each college be normally distributed with the mean ability of direct applicants allowed to differ between colleges but with the variance constrained to be the same for all colleges. In particular, let the ability of direct applicants to college j be distributed AD j ∼ N(µD j , 1) where AD j is the ability of a direct applicant to college j and µD j is the mean ability of direct applicants to college j. Colleges also receive open applicants. In total there are O1 + O2 + . . . OJ = O open applicants to Oxford and their ability follows standard normal distribution: AO ∼ N(0, 1). Oxford admissions procedures require that all open applicants are pooled together by the Undergraduate Admissions Office. Open applicants are then randomly drawn out, one at a time and are allocated to the college with the lowest direct applicant to place ratio. This random assignment to colleges, it the key to my selection on unobservables identification procedure. I present evidence in section 5.5.1 that supports random assignment. Since each college receives a random sample (of size Oj) of open applicants, the ability of open applicants sent to college j, denoted AO j , is also distributed N(0, 1). 3.2.2 Application Profiles Admissions at Oxford colleges are conducted by faculty, who are also researchers and teachers, in the subject a student applies for (referred to as “admissions tutors”). Applicant ability Ai and college effects cij are not perfectly observable to admissions tutors. Instead colleges observe an applicant’s application profile (“UCAS form”) which includes both “hard characteristics” such as GCSE results, A- level results and the results of Oxford-specific admission tests and “soft” characteristics such as school 11
  • 16. reference letters and evidence of enthusiasm in the personal statement.12 The application profile does not include whether an applicant was a direct applicant or an open applicant. Application profiles can be thought of as a noisy signal of the ability of each applicant. Denote the characteristics of applicant i seen by admission tutors as a 1 x K row vector xi = Ai −ri where ri is a 1 x K row vector. Each of the K elements in xi provides a signal about a component of ability Ai. For example, maths GCSE result provides a signal of maths ability. Assume that each element of xi is an unbiased signal for its equivalent element in Ai such that E(Ai|xi) = Ai. Also assume xi and cij are independent, that is, application profile xi provides admissions tutors with no information about college effects cij (This assumption is relaxed in some of the empirical work). Let X denote the support of x and let Xj denote the support of the application profiles for students allocated to college j. Let ηj(x) be the number of students allocated to college j with application profile x. 3.2.3 Enrolment Probabilities and Expected Exam Results Let αj(x) denote the probability that student with application profile x, upon being offered admission at college j, eventually enrols. Let Yj(x) denote the expected exam result of an applicant with application profile x who enrols at college j. This allows acceptance or rejection of an offer from college j to provide extra information about the ability (and expected exam result) of an applicant. Colleges need to condition on acceptance when making admissions decisions in order to make a correct inference about the student’s ability because of an “acceptance curse”: the student might accept college j’s admission because she is of low ability and is rejected by other universities (either UK or foreign). 3.2.4 The College Admissions Problem Define an admission protocol for college j as a probability pj : Xj → [0, 1] such that an applicant allocated to college j with application profile x is offered admission at college j with probability pj(x). Each college has a capacity constraint, Kj (the maximum number of students college j can 12Information on ethnicity and parental social class is also collected on the UCAS form but this information is not available to admissions tutors when they decide on admissions 12
  • 17. admit). College j thus chooses the set of pj(x) ∈ [0, 1] to maximise their objective function: maxpj (x) x∈Xj pj(x) αj(x) ηj(x) Yj(x) (3) subject to their capacity constraint: x∈Xj pj(x) αj(x) ηj(x) ≤ Kj. (4) This is almost identical to the university admissions decision problem studied by Bhattacharya et al. (2014) (see also Fu (2014)). The college objective is to maximise total expected exam results among the admitted applicants. It implicitly assumes “Fair Admissions” (Bhattacharya et al., 2014), in the sense that it gives equal weight to the exam results of all applicants, regardless of pre-admission characteristics. This assumption is plausible at Oxford because Oxford emphasises that applicants are admitted strictly based on academic potential. Extra-curricular activities, such as sport and charity work are given no weight unless they are related to academic potential. “Fair Admissions” is consistent with the “Common Framework” which guides undergraduate admissions at Oxford: “Admissions procedures in all subjects and in all colleges should [. . . ] ensure applicants are selected for admission on the basis that they are well qualified and have the most potential to excel in their chosen course of study” (Lankester et al., 2005). The solution to college j’s admissions problem takes the form described below in Proposition 1, which holds under Condition 1: admitting everyone with an expected exam result Yj(x) ≥ 0 will exceed capacity in expectation (Bhattacharya et al., 2014). Condition 1. αj(x) > 0 for any x ∈ Xj and for some δ > 0 we have x∈Xj αj(x) ηj(x) 1{Yj(x) ≥ 0} ≥ Kj + δ. Proposition 1. Under Condition 1 the solution the college j’s admissions problem is: pOP T j = 1 if Yj(x) ≥ zj 0 if Yj(x) < zj where zj = min r : x∈Xj αj(x) ηj(x) 1 {Yj(x) ≥ r} ≤ Kj 13
  • 18. Proof in Appendix. The model shows that college j uses a cut-off rule (admission threshold). The result is intuitive. Colleges first rank applicants by their expected exam results (conditional on acceptance). Colleges then admit applicants whose expected exam results are the largest, followed by those for whom it is the next largest and so on till all places are filled. An admissions policy for the ranked groups {pj(x)} takes the form {1, . . . , 1, 0, . . . , 0}. Since ability is continuously distributed and x is an unbiased signal, x is also continuously distributed. Hence there are no point masses in the distribution of Yj(x) and there is no need for account for ties. As noted by Bhattacharya et al. (2014), the probability of a student enrolling having received an offer from college j affects the admission rule only through its impact on the cut-off; the intuition is that individuals who do not accept an offer of admission do not take up any capacity and this is taken into account in the admission process. Also note that the assumptions imply, perhaps unrealistically, no role for risk in admissions decisions. The Fair Admissions assumption implies student characteristics influence the admission process is through their effect on expected exam results. The same cut-off zj is used for open and direct applicants - there is no discrimination against open/direct applicants (or any demographic group). Discrimination would occur if colleges had a higher cut-off for open applicants than direct applicants as this would imply that a direct applicant with the same expected exam result as a open applicant is more likely to be admitted. Equal cut-offs for open and direct applicants are plausible because, as noted above, colleges are not provided with any information about whether an applicant applied directly or was an open applicant. The solution is illustrated in Figure 1 for the case where applicant ability is fully observed by admissions tutors: xi = Ai (ri = 0 for all i).13 13This model is a highly stylised model of admissions. For simplicity, it ignores a number of features of the admissions process. Oxford admissions actually involve multiple stages. In the first stage colleges choose which applicants to “short-list” and “deselect” and which applicants to “reserve”. Deselected applicants are rejected. Short-listed and reserved applicants are given interviews at the college they were allocated. Shortlisted but unreserved applicants may be reallocated to another college for interview. After first interviews colleges make some admissions decisions about which applicants to accept. However, a small number of applicants are given second interviews. Second interviews provide applicants not selected by their first college the chance to be accepted by another college (known as “pooling”). It should also be noted that application procedures vary slightly between courses. Capturing all these points would involve a more complex dynamic game played between colleges. Nevertheless, my empirical work relies only on the 14
  • 19. Figure 1: Applicant Ability and College Admissions Decisions Aj D ∼ N(µj D ,1) Aj O ∼ N(0,1) pj D pj O −4 −2 0 2 4zjµj D Ability Figure 1 shows how colleges would make admissions decisions if ability was fully observable (i.e. Ai = si). Direct applicant ability to college j is distributed Aj D ∼ N(µj D ,1). The graph is drawn such that µj D = 0.5. Open applicant ability to college j is distributed Aj O ∼ N(0,1). zj is the cut−off (admissions threshold). All students with ability above the cut−off (the shaded area) are admitted. The distribution of ability for successful open applicants to college j follows a truncated normal distribution and similarly for successful direct applicants. A proportion pj D of direct applicants and a proportion pj O of open applicants are accepted. With this admissions model in mind, the goal is to estimate the college effects cij. I consider three different empirical models. First, as a simple baseline, I consider differences in mean exam results between colleges in the spirit of the Norrington table. Second, I use a “selection on observables” strategy that attempts to estimate college effects by conditioning on almost all the information available to admissions tutors in the student’s application profile. Third, I take advantage of the random assignment of open applicants and estimate the thresholds zj for each college. I then use these threshold estimates together with the assumptions of the theoretical model to obtain estimates of college effects. The next section explains these strategies in detail. result that colleges use a cut-off rule and that the cut-off is equal across all applicants. This result would continue to hold if, for example, (i) no new information about applicant ability was revealed at interview, (ii) colleges could correctly predict the admissions decisions of other colleges and (iii) the reallocation of rejected applicants was known in advance by the colleges. 15
  • 20. 4 Econometric Models The econometric models in this section must acknowledge some objects in the theory model are un- observable. First, exam results for applicants who do not attend Oxford are not observed. Second, even for the applicants who enrol at Oxford, at most one potential exam result per student is ob- servable (the potential exam result from the college they actually attend). This is the “fundamental problem of causal inference” (Holland, 1986). With a slight abuse of notation I denote observed exam results of student i at college j as Yij for i = 1, ..., n. Third, not all the information in an applicant’s application profile is observable. Decompose the information in application profiles into two parts: x = x1 +x2 where x1 and x2 are 1 x G and 1 x K - G row vectors with (with K G and remembering x is 1 x K). “Hard” information x1 is assumed observable to admissions tutors and researchers. “Soft” information x2 is assumed observable to admissions tutors but not researchers. The aim is to identify college effects given the available data. All three empirical strategies take the potential exam results function (1) specified in section 3 and assume observed exam results take the linear form: Yij = λ0 + λ1Ai + cij + eij (5) where λ0 and λ1 are K x 1 column vectors that map ability onto potential exam results and all elements of λ1 are strictly positive. I can now decompose Ai into x1i, x2i and ri and rewrite (5) as: Yij = λ0 + λ11x1i + λ12x2i + cij + λ1ri + eij (6) where λ11 is a G x 1 column vector of the first G elements of λ1 and λ12 is a K - G x 1 row vector of the last K - G elements of λ1. Student ability unobserved even by admissions tutors is captured by ri. 4.1 Model 1 – Norrington Table The first empirical strategy is to estimate college effects using a student-level fixed effects regression with no control variables for observable or unobservable ability. That is, Model 1 estimates for 16
  • 21. enrolled students: Yij = λ0 + J−1 j=1 ¯βjCj + vij ∀ i = 1, ..., n (7) where vij = J−1 j=1 (βij − ¯βj)Cj + λ1Ai + eij, Cj is a dummy variable denoting enrolment at college j, βij is a college fixed effect coefficient which may differ across i and ¯βj = 1 n n i=1 βij is the average over students of the college fixed effects. College J is the reference college. Model 1 can be estimated by regressing exam results on a set of college dummy variables. The fixed effect coefficients ¯βj are the objects of interest, they give mean differences in exam results relative to the baseline college. Model 1 is thus similar in spirit to the Norrington table.14 The most important problem with Model 1 (and the Norrington table) is selection bias. Selection bias prevents us from interpreting the fixed effect coefficient estimates as causal effects. Randomised experiments are the gold standard for estimating causal effects and imagining a hypothetical random- ised experiment helps to conceptualise the selection bias problem. Consider a two stage admissions process. In stage 1 it is decided which students will attend Oxford. In stage 2 admitted students are randomly assigned to colleges. In this ideal scenario, college assignment is independent of student ability among the population of enrolled students, so the simple mean difference in observed exam results gives an unbiased estimate of differences in college effects for students attending Oxford. Unfortunately for researchers selection into colleges is non-random in ways that are correlated with exam results. Students and admission tutors deliberately and systematically select who enrols. At the application stage, students choose where to apply to. At the admissions stage, admission tutors take decisions to accept some students and not others. There could also be selection at the at the enrolment stage (in practice, very few students reject offers from Oxford colleges). The selection bias problem makes it difficult to attribute student exam results to the effect of the college attended separately from the effect of preexisting student ability. Formally, since we have assumed λ11 = 0 and λ12 = 0, selection bias occurs if: 14Model 1 does differ from the Norrington table is some ways. For instance, the Norrington table does not take into account of differences across courses (getting a First in E&M may be easier or more difficult than getting a First in Law). As I explain in section 5 below, I standardise exam results by course and year which mitigates this problem. 17
  • 22. E   J−1 j=1 (βij − ¯βj)Cj + λ1Ai + eij|cij   = 0. Model 1 embodies two types of non-random selection into colleges. First, selection on the het- erogeneous college effect βij. This occurs if individuals differ in their potential exam results, holding ability Ai constant, and if they choose a college (or colleges chooses them) in part on that basis.15 Selection on heterogeneous college effects captures the intuition that students and colleges are looking for a good “match”. The economics of the problem suggest students will tend to apply to colleges that are relatively good at boosting their exam results - a form of selection bias that bares similarities to Roy’s model of occupational choice (Roy, 1951). Similarly colleges will tend to make offers to students who tend to benefit more than average from the college’s teaching. Students enrolled at college j may thus have higher expected exam results from attending college j than the average student. This biases college fixed effect coefficients and it would not be appropriate to interpret such estimates of as causal effects for the average student enrolled at Oxford (though college effect estimates biased in this way may still be of interest). Second, selection on ability Ai. Determinants of exam results may be correlated with college enrolment even if college effects are constant across students (βij = βj for all i). This occurs if individuals choose colleges or colleges choose students in ways correlated with prior ability. Rational applicants will choose to apply to the college that maximises their expected utility. Expected utility is likely to depend on a number of factors including the perceived probability of receiving an offer from each college, risk aversion, the value of their outside option if they did not attend Oxford and preferences over college characteristics (including college effectiveness and other characteristics contributing towards consumption benefits). Observable and unobservable ability are likely to impact the college a student applies to. Furthermore college admissions decisions are based on student ability. Positive selection seems likely, though not inevitable, with students of higher ability tending to go to more effective colleges. In the presence of such selection, estimates of the college fixed effect coefficients will be biased in favour of colleges with higher ability students. 15This assumes students and tutors have an idea of their own student/college-specific coefficient. 18
  • 23. Selection bias causes three problems. First, as discussed, college effectiveness estimates are biased. Second, the importance of variation in college effectiveness in determining exam results could be exaggerated. The total effect of colleges on student exam results could be overstated because some of the omitted ability will be included in the portion of the variance in student exam results explained by college effects.16 Third, bias would lead to errors in supplementary analyses that aim to identify the characteristics effective colleges. Selection bias implies Model 1 is best used as a basis for comparison with other models that control for observables and unobservables. 4.2 Model 2 – Selection on Observables The second empirical strategy is to estimate college effects using a conditional OLS regression. Model 2 estimates for enrolled students: Yij = λ0 + λ11x1i+ J−1 j=1 ¯βjCj + vij ∀ i = 1, ..., n (8) where now vij = J−1 j=1 (βij − ¯βj)Cj +λ12x2i +λri +eij. The difference between Model 1 and Model 2 is that now observable parts of application profiles x1i are included in the regression. The objects of interest are the college fixed effect coefficients: ¯βj. In an ideal scenario, we could interpret estimated coefficients as estimates of the average causal effect relative to the reference college for students attending Oxford. However, such a causal interpretation requires three further assumptions. I start with two that are relatively unproblematic. Assumption 3. “Interval scale metric”. The metric of Yij is interval scaled. Assumption 3 says that the units of the exam result distribution are on an interval scale (Ballou, 2009; Reardon and Raudenbush, 2009). Interval scales are numeric scales in which we know not only the order, but also the exact differences between the values. Here the assumption says equal sized gains at all points on the exam result scale are valued equally. A college that produces two students with scores of 65 is considered equally as effective as a college producing one with a 50 and another 16The effect of the bias on variation in college quality would depend on the direction of the bias. The text here presumes the likely scenario with positive selection bias – i.e., where more effective colleges are assigned students with higher expected exam results. 19
  • 24. with 80. In comparing mean values of exam results, I implicitly treat exam results as interval-scaled (the mean has no meaning in a non-interval-scaled metric). If exam results are not interval scaled then the college effect results will depend on arbitrary scaling decisions.17 However, it is unclear how to determine whether exam results are interval scaled because there is often no clear reference metric for cognitive skill (Reardon and Raudenbush, 2009). At a practical level, the importance of this assumption comes down to the sensitivity of college effects estimates and college rankings to different transformation of exam results. Prior evidence on this point is reassuring, Papay (2011) finds test scaling affects teacher rankings only minimally with correlations between teacher effects using raw and scaled scores exceeding 0.98.18 I proceed as if exam results are interval scaled and in section 6 test the robustness of my results to various monotonic transformations of the exam results distribution. Assumption 4. “Common Support or Functional form”. Either (i) there is adequate observed data in each college to estimate the distribution of potential exam results for students of all types (“Common Support”) or (ii) the functional form of Model 2 correctly specifies potential exam results even for types of students who are not present in a given college (“Functional Form”). Either “Common Support” or “Functional Form” must hold for college effects to be identified. The common support assumption is violated if not all colleges contain students with any given set of characteristics. For instance, if not all colleges have students at all ability levels (or not sufficient numbers at all levels to provide precise estimates of mean exam results at each ability level), then the common support assumption will fail. In this case we have identification via functional form - the model extrapolates from regions with data into regions without data by relying on the estimated parameters of the specified functional form. If the functional form is also wrong, then regression estimators will be sensitive to differences in the ability distributions for different colleges. However, if the distribution of ability are similar across colleges the precise functional form used will not matter much for estimation (Imbens, 2004). The common support assumption has been questioned 17This assumption could be relaxed by adopting a non-parametric approach (and comparing, for example, quantiles rather than means) but this would require a very large sample size for accurate estimation. 18If two colleges have similar students initially, but one produces students with better exam results, it will have a higher measured college effect regardless of the scale chosen. Similarly, if they produce the same exam results, but one began with weaker students, the ranking of the colleges will not depend on the scale. 20
  • 25. for schools because student covariates differ significantly across schools. However, the distribution of ability is likely to be much more similar across Oxford colleges, partly because of student reallocation across colleges during the admission process. We now come to the most significant problem in estimating college effects: how to deal with selection bias. I make the following two-part “selection on observables” assumption, which allows consistent estimation of college effects: Assumption 5. “Selection on Observables” (i) E J−1 j=1 (βij − ¯βj)Cj | Cj, x1i = 0 ∀ i = 1, ..., n (ii) E [λ12x2i + λri + eij | Cj, x1i] = 0 ∀ i = 1, ..., n The selection on observables assumption follows work by Barnow et al. (1981) in a regression setting who observed that unbiasedness is attainable only when the variables driving selection are known, quantified and included in x1.19 Together parts (i) and (ii) imply that potential exam results are independent of college assignment, given x1. Part (i) requires the heterogeneous part of college effects to be mean independent of college enrolment conditional on x1i and Cj. This assumption is similar to, but slightly weaker than, college effects being the same for every student. It implies there is no interaction of college effects with student characteristics in x1i. As noted above, if individuals differ in their college effects, and they know this, they ought to act on it, even conditional on ability. Thus this assumption relies on students and tutors being unaware of college effects.20 In the empirical work, I test this assumption by allowing the college effect coefficients to vary with some elements of x1i. Part (ii) says the observable control variables x1i are sufficiently rich that the remaining variation in college enrolment that serves to identify college effects is uncorrelated with the error term in equation (8). This requires two things. First, the observable control variables in x1i must capture, either directly or as proxies, all the factors that affect both the college enrolment and exam results. Second, there must exist variables not included in the model that vary college enrolment in ways unrelated to the unobserved component of exam results (i.e. instrumental variables must exist, even 19Non-parametric versions of this assumption are variously known as “conditional independence assumption” Lechner (2001) and “unconfoundedness” Rosenbaum and Rubin (1983). These are also closely related to “strongly ignorable assignment”Rosenbaum and Rubin (1983). 20If college effects were obvious to everyone then there would be no need for this thesis! 21
  • 26. though we do not observe them, as they produce the conditional variation in college enrolment used implicitly in the estimation). Intuitively, the aim is to compare two otherwise identical students but who went to different colleges for a reason completely unrelated to their exam results. Practically, I would like to measure and condition on any characteristic whose influence on exam results might be confounded with that of college enrolment due to non-random sorting into different colleges. I am aware that the selection on observables assumption is somewhat heroic. Unobservable ability could cause it to be violated. For instance, students with very high unobservable ability x2i (including excellent school references and personal statements) may be close to certain of receiving an offer from whichever college they apply to and thus may tend to apply to colleges with larger college effects. Alternatively more “academically motivated” students may be both more likely to apply to colleges that improve exam results than college that provide large consumption benefits. If students do select into colleges based on unobservable ability correlated with exam results conditional on observed characteristics then selection bias results. Nevertheless, the selection on observables assumption can be justified in a number of ways. First, the extensive dataset allows me to condition on almost all information available to college admission tutors when they are selecting students as well as some information not seen by admissions tutors. Furthermore, there is evidence that the information available to admissions tutors but unavailable to researchers, the personal statement and school reference, are relative unimportant in admission decisions. In the personal statement, students describe the ambitions, skills and experience that make them suitable for the course (e.g. previous work experience, books students have read and essay competitions they have entered). However, Oxford admissions are strictly academic so this only impacts admissions decisions if it is linked to academic potential. The absence of the school reference is also perhaps of limited significance because, as noted by Bhattacharya et al. (2014), school references tend to be somewhat generic and within-school ranks are typically unavailable to admission tutors. This is supported by survey evidence. Bhattacharya et al. (2014) conduct an anonymised online survey of PPE admissions tutors in Oxford asking much weight they attach during admissions to covariates with "1" representing no weight and "5" denoting maximum weight. The results, based on 52 responses, found that the personal statement and school reference were given 22
  • 27. the lowest weights.21 Second, two students with the same values for observed characteristics may go to different colleges without invalidating the selection on observables assumption if the difference in their colleges is driven by differences in unobserved characteristics that are themselves unrelated to exam results. There are plenty of potential sources of exogenous variation in college allocations conditional on observables. For instance, students might care about factors other than the ability of colleges to boost exam results. Observation indicates that many applicants explicitly choose among colleges, at least at the margin, for reasons unlikely to be strongly related to exam results. Application decisions may reflect preferences over college location, architecture, accommodation, facilities and size. These preferences may not be strongly linked to ability to perform well in exams. Indeed selection based on preferences over college characteristics is actively encouraged by the University - the Oxford website recommends students choose colleges based on these non-academic considerations. Alternatively applicants might be incapable of discerning the size of college effects. While this would not normally be a comforting thought, it aids the selection on observables assumption. Evidence from university admissions supports this point. Scott-Clayton (2012) reviews the literature on university admissions and concludes applicants and parents often know very little about the likely costs and benefits of university. For instance, small behavioural economics tricks such as whether or not a scholarship has a formal name and a tiny change in the cost of sending standardised test scores to universities have been shown to have non-trivial effects on university applications inconsistent with rational choice (Avery and Hoxby, 2004; Pallais, 2013). The school choice literature also provides evidence that students and parents do not select schools according to expectations about future test scores - the typical voucher program does nothing to improve test scores (Epple et al., 2015). Such exogenous variation is perhaps even more likely in the context of Oxford colleges because Oxford deemphasises the importance of college choice, stressing all colleges are similar academically and that the primary factor when choosing a college college choice should be consumption benefits not exam results. A couple of final points about Model 2 should be noted. First, since I have multiple cohorts of students, I pool students across cohorts for each college. Evaluating colleges over multiple years 21A-levels appeared to be the most important criterion, followed by the admissions tests and interview scores and then GCSE performance. The choice of subjects at A-level was given a medium weights. 23
  • 28. reduces the selection bias problem (Koedel and Betts, 2011), increases students per college thus reducing average standard errors (McCaffrey et al., 2009) and increases the predictive value of past college effects over future college effects (Goldhaber and Hansen, 2013). In pooling across cohorts, I assume that college effects are fixed over time and thus place equal weight on exam results in all years.22 Second, I allow for heteroskedastic measurement error in exam results by estimating heteroske- dasticity robust standard errors. Exam results measure latent achievement with error because of (i) the limited number of questions on exams, (ii) the imperfect information provided by each question, (iii) maximum and minimum marks, (iv) subjective marking of exams and (v) individual issues such as exam anxiety or on-the-day illness (Boyd et al., 2013). Numerous studies find test score meas- urement error is larger at the extremes of the distribution (Koedel et al., 2012). The intuition is exams are well-designed to assess student learning for “targeted” students (near the centre of the distribution), but not for students whose level of knowledge is not well-aligned with the content of the exam (in the tails of the distribution). Ignoring heteroskedastic measurement error in the dependent variable would lead to biased inference. In addition, ignoring measurement error in the control variables would bias college effect estimates. However, I control for multiple prior test scores (A-levels, GCSEs, IB, multiple admissions tests and interview scores) which has been shown to help mitigate the problem (Lockwood and McCaffrey, 2014). Third, I treat college effects as fixed effects rather than random effects. Whilst random effects models are more efficient than fixed effects models, economists have conventionally avoided random effect approaches (Clarke et al., 2010). This is because their use comes at the cost of an important additional assumption - that college effectiveness is uncorrelated with the student characteristics that predict exam results. This “random effects assumption” would fail, for example, if more effective colleges attracted high ability students measured by prior test scores. Random effect estimators would be inconsistent for fixed college sizes as the number of colleges grows.23 By contrast, fixed 22As the number of cohorts grows, “drift” in college performance may put downward pressure on the predictive power of older college effect estimates. Thus if predicting future college effects is the main aim (relevant for prospective applicants to Oxford) then it may be best to down-weight older data (Chetty et al., 2013a). However, my main aim is to gauge the importance of college effectiveness and thus do not account for drift. 23The bias (technically, the inconsistency) disappears as the number of students per college increases - because the random effect estimates converge to fixed effect estimates. However, the bias still can be important in finite samples. 24
  • 29. effect estimators will still be consistent for fixed college sizes as the number of colleges grows. Guarino et al. (2015) find that under non-random assignment, random effect estimates can suffer from severe bias and underestimate the magnitudes of college effects. They conclude fixed effect estimators should be preferred in this situation and I follow their advice and specify college effects as fixed effects. In section 6, I perform Hausman tests (robust to heteroskedasticity) and the results broadly support this choice. Fourth, I do not employ shrinkage to my college effect estimates. Estimates can be noisy when there are only a small number of students per college. This means colleges with very few students could be more likely to end up in the extremes of the distribution (Kane and Staiger, 2002). Shrinkage is often used as a way to make imprecise estimates more reliable by shrinking them toward the average estimated college effect in the sample (a Bayesian prior). As the degree of shrinkage depends on the number of students per college, estimates for colleges with fewer students are more affected, potentially helping with the misclassification of these colleges. The cost of shrinkage is that the weight on the prior introduces a bias in estimates of college effects. Shrinkage can be applied to both random and fixed effects models (so shrinkage is not a reason to favour random effect models as is sometimes suggested). Despite the promise of shrinkage, two studies use simulations to show shrinkage does not itself substantially boost performance (Guarino et al., 2015; Herrmann et al., 2013). Fixed effect models without shrinkage tend to perform well in simulations and should be the preferred estimator when there is a possibility of non-random assignment. Even though I avoid having to make the random effects assumption, there is still a danger that the selection on observables assumption is violated. As a result I now move on to Model 3 which can more effectively deal with unobservables. 4.3 Model 3 – Selection on Observables and Unobservables In this subsection I use a novel procedure to estimate college effects and account for both selection on observables and unobservables. To do this, I take the theory model of section 3 as a starting point and assume the ability Ai is a scalar (with multiple sources of ability, Ai can be interpreted as a composite scalar index, i.e. a weighted average). When ability Ai is a scalar, I can estimate the 25
  • 30. admission thresholds zj for each college. Admission thresholds can be consistently estimated because open applicants are randomly allocated to colleges. I then use these threshold estimates and the linear function form assumption (5) to obtain estimates for Ai and λ1. Colleges with high admissions thresholds tend to have high ability entrants. This allows me to obtain college effect estimates. I now explain this procedure in more detail. First, remember in the theory model of section 3, the ability of open applicants to Oxford was distributed N(0, 1). The key to identification is that open applicants are randomly allocated, by the Undergraduate Admissions Office, to colleges. Intuitively, the random allocated means that all colleges receive open applicants with equal ability on average. If a college accepts a large proportion of open applicants, this suggests that their cut-off zj is low and their entrants have relatively low ability. On the other hand, if a college accepts a small proportion of open applicants then we expect their cut-off to be high and their entrants to be of relatively high ability. Formally, the ability of open applicants allocated to college j is also distributed N(0,1). This means we can consistently estimate the true cut-off zj at college j using the estimator: ˆzj = Φ−1 1 − pO j (9) where is Φ is the standard normal cdf and pO j is the proportion of open applicants allocated to college j who are offered a place at college j (pO j is the area in the upper tail of the standard normal distribution). When pO j is large, ˆzj is small and vice versa. In an infinite sample we could determine the cut-off value zj exactly. However colleges are assigned a finite number of open applicants so we estimate zj using ˆzj. As a simple example, consider the case where a college accepted 5% of the open applicants they were allocated by the Undergraduate Admissions Office. Hence pO j = 0.05 and the admissions threshold is estimated to be ˆzj = 1.645. Since college j uses the same admissions threshold for both open and direct applicants, we expect applicants with ability Ai ≥ 1.645 to be accepted and applicants with ability Ai < 1.645 to be rejected. Second, note again the ability of open applicants sent to college j is distributed N(0, 1), the ability direct applicant’s to college j is distributed N(µD j , 1) and each college makes offers to students with expected exam results above their cut-off. Together these three statements imply the distribution of 26
  • 31. ability for successful open applicants to college j follows a truncated normal distribution and similarly for successful direct applicants. The truncations have the same cut-off point zj but the mean of the truncated normal distributions may differ. This is shown in Figure 1. Now consider an equation analogous to (9) but this time for direct applicants: ˆzD j = Φ−1 (1 − pD j ) where pD j is the proportion of direct applicants, assigned to college j, who are offered a place at college j. I refer to zD j as the standardised cut-off for the ability of direct applicants zD j . Together (i) the true cut-off zj, (ii) the standardised cut-off for the ability of direct applicants zD j and (iii) the assumption that the standard deviation of ability for direct applicants is equal to the standard deviation of the ability of open applicants: σD = σO = 1, give the mean ability of direct applicants to college j µD j through the equation: zD j = zj − µD j σD ⇐⇒ µD j = zj − σD zD j = zj − zD j Since zj and zD j are unobservable, I use the estimator: ˆµD j = ˆzj − ˆzD j (10) Using the standard result for the mean of a truncated normal distribution gives an estimator for the average ability of open and direct applicants given offers by college j: E(AO j |AO j > zj) = φ(zj) 1 − Φ(zj) ; E(AD j |AD j > zj) = µD j + φ(zD j ) 1 − Φ(zD j ) (11) where φ is the standard normal pdf and φ(.) 1−Φ(.) is the hazard function for the normal distribution. Equation (11) gives estimates of average student ability for students enrolled at each college (which is the average of the upper tail in the normal distributions in Figure 1). Next, use the linear function form assumption for exam results given in equation (5) to estimate the parameters λ0 and λ1. By definition, average realised exam results at college j for enrolled open applicants and enrolled direct applicants are given by: Y O j = 1 O∗ j i EO j (λ0 + λ1A + cij + eij) ; Y D j = 1 D∗ j i ED j (λ0 + λ1A + cij + eij) where EO j is the set of open applicants who were allocated to college j and who enrolled at college j, ED j is the set of direct applicants to college j and who enrolled at college j, O∗ j is the number of 27
  • 32. open applicants who were allocated to college j and who enrolled at college j, D∗ j is the number of direct applicants to college j and who enrolled at college j, Y O j is the average realised exam result of open applicants enrolled at college j and Y D j is the average realised exam results of direct applicants enrolled at college j. Now assume college effects are constant across students so cij = cj for all i. Taking differences causes the college effect cj and the constant term λ0 to drop out: Y O j − Y D j = λ1   1 O∗ j i EO j Ai − 1 D∗ j i ED j Ai   + 1 O∗ j i EO j eij − 1 D∗ j i ED j eij. E(AO j |AO j > zj) − E(AD j |AD j > zj) can be used as an estimator for 1 O∗ j i Ej Ai − 1 D∗ j i Ej Ai for each college j. Thus we can estimate λ1 using an OLS regression: Y O j − Y D j = λ1 E(AO j |AO j > zj) − E(AD j |AD j > zj) + 1 O∗ j i Ej eij − 1 D∗ j i Ej eij (12) with J observations, one for each college. This gives OLS estimates λ1. Note there is no constant in this regression because λ0 has been differenced away. Unfortunately, heteroskedastic measurement error in the explanatory variable will cause the OLS estimate of λ1 will be biased – the estimates of mean ability of enrolled students contain estimation error and this estimation error differs across observations (it is likely to be larger for colleges with fewer open applicants as this means that the cut-off is less accurately estimated). Whilst methods exist to correct for heteroskedastic measurement error in simple cases (Sullivan, 2001), correcting λ1 estimates is more complex and, as far as I am aware, there is no appropriate method to correct for this. Once we have λ1, we can back-out cO j and cD j which are estimates of college effects (inclusive of the constant term λ0) for open applicants and direct applicants: cO j = Y O j − λ1E(AO j |AO j > zj) ; cD j = Y D j − λ1E(AD j |AD j > zj) Since we have assumed college effects are constant across students, cO j and cD j are also estimates of the true college effects cj. A single college effect estimate can be obtained by taking a weighted average of cO j and cD j , where the weights correspond to the number of students who took Prelims 28
  • 33. exams: cj = O∗ j O∗ j + D∗ j cD j + D∗ j O∗ j + D∗ j cD j . (13) Finally, to make the results of Model 3 directly comparable to those from Model 1 and Model 2, I present college effects relative to those of the best performing college, college J: ¯βj = cj − cJ . (14) Implementing Model 3 in practice requires a number of decisions to be taken with regard to the data. First, I decide to pool across years as done in Model 1 and Model 2. This increases preci- sion by increasing the number of applicants (particularly open applicants) at each college. Pooling applications across years is not ideal because it does not reflect how admissions are carried out in practice, however open applicants will still be randomly allocated to college and if the distribution of applicant ability is the same each year then cut-offs will be approximately the same across years. Second, I only compare the subset of colleges with at least 50 open applicants (again to increase precision). Third, whereas for Model 1 and Model 2, all students with Prelims scores are included in the analysis, for Model 3, applicants not selected by the first college they were allocated to (these students were “Rejected by College 1”) are not used in the analysis because their expected ability is unknown. This means that Model 3 nests Model 1 as a special case where λ1 = 0 and where Model 1 is estimated on a reduced sample only containing applicants selected by the first college they were allocated to. 5 Data 5.1 Why use Four Datasets? I use four different datasets due to a trade-off between sample size and the availability of key covari- ates. The largest dataset consists of anonymised data on all Oxford applicants in the years 2009-2013. Information on these students was combined from two different sources. Firstly application records obtained from the Student Data Management and Analysis (SDMA) team at Oxford University. 29
  • 34. Table 1: Information Available in each Dataset PPE E&M Law All Subjects Personal Characteristics Y Y Y Y Contextual Information Y Y Y Y Previous School Type Y Y Y Y GCSEs, A-levels and IB Y Y Y Y Breakdown of A-levels by Subject N Y Y N Admissions Test Scores Y Y Y N Interview Scores N N Y N School Reference N N N N Personal Statement N N N N Individual Paper Marks Y Y Y N Second, for enrolled students, the application records were then linked to student records (also held by the SDMA) through unique student identifiers. Exam results are contained in student records. I refer to this large dataset as the “All Subjects” dataset because it covers all courses taught at Oxford. Its obvious advantage is the large number of students. However focusing exclusively on this large dataset is limiting for a number of reasons. First, given Model 2 relies on a selection on observables assumption, it is important to condition on all relevant covariates used in the admissions process. Time, resource and data availability constraints prevented the SDMA from supplying inter- view scores, admissions test scores and specific A-level subjects taken for all students taking every Oxford course. For courses where this information is missing, the selection on observables assumption is much less credible. Second, observable ability controls included on the RHS may have a different impact on exam results depending on the courses taken, e.g. the effect of an A-level in economics is probably different if a student studies E&M rather than Law at Oxford. Third, college effects may differ across courses, given that the quality of teaching may vary within colleges. Fourth, admissions procedures are carried out at a course (department) level so the theoretical model in section 3, implies open applicants are only randomly to colleges within subjects. For these reasons I also analyse three other datasets containing information on PPE, E&M and Law students respectively. I choose these courses because very detailed admissions data is available for each of them and because they all receive large numbers of applications. The information available in these datasets is summarised in Table 1. 30
  • 35. 5.2 Choice of Outcome Variable Preliminary Examinations (“Prelims”) are the exams taken by students at the end of their first year at Oxford. In PPE, E&M and Law students each take three first year papers, all marked out of 100. Each script is marked blindly (so the marking tutors do not know which college the student comes from). The main outcome variable I use is a student’s average Prelims score standardised within cohort (and course for the All Subjects dataset). For instance, to construct my outcome variable for PPE, I first take the average score across the three first year papers and then I then standardise the result so the mean for each cohort is 0 and the standard deviation for each cohort is 1. Standardising exam results by cohort is important because the distribution of exam scores var- ies from year to year (partly due to variation in exam difficulty) even within the same course. I also standardise by course for the All Subjects dataset because there is significant variation between subjects in Prelims averages and this variation is mostly unrelated to college effectiveness.24 Stand- ardising Prelims averages across subjects avoids penalising colleges that teach courses with lower Prelims averages.25 Using Prelims average is preferable to estimating separate models for each Pre- lims paper taken for two reasons. First, it increases precision. Second college effectiveness is very likely to “spill over” across papers. Research has demonstrated that better university exam performance is closely related to other desirable outcomes which supports the exam based measurement of college effectiveness (Smith et al., 2000; Walker and Zhu, 2013; Feng and Graetz, 2015; Naylor et al., 2015). One minor problem is that interpreting Prelims scores is complicated by the that fact a small number of students retake papers. Students only retake papers if they fail first time around. In this case the data I have corresponds the highest mark they obtained which may be the first or second attempt. It would have been preferable if I had the Prelims scores from first attempts. However retakes are rare so this should not be a significant problem. An obvious alternative outcome variable is Final Examination (“Finals”) results such as average 24The variation may reflect differences between subjects in the nature of the subject matter (arguably, natural science exams are conducive to more extreme patterns of results) and in conventions within subjects of what is of sufficient merit to be awarded a given mark. 25I don’t standardise marks for each individual paper because students and colleges may optimally concentrate their teaching efforts on the Prelims papers that have a higher variance of marks. 31
  • 36. score across Finals papers. However, Prelims results are preferred for a number of reasons. First, attrition is greater with Finals (because more students drop out over time) and this implies more missing data which can bias college effect estimates. Second, in Finals not every student takes the same exams because of different option choices. This is problematic because there are differences in score distributions across different options. Third, using Finals results involves excluding students still in their first or second years at Oxford, substantially reducing the power of the analysis.26 However, when interpreting the results one should keep in mind that Prelims are less important to students than Finals (they are “lower stakes” exams) and Prelims may over or underestimate Finals college effects (underestimate because they give less time for any college effect to become evident and because college effects may be cumulative. Overestimate because teaching is more college-focused in first year than later years). For these reasons I focus on standardised average Prelims scores in the main analysis but also briefly consider the consequences of using individual first year paper scores and average Finals score as outcome variables. 5.3 Choice of Control Variables The control variables included in the analysis are summarised on the Table 2. Most of the controls will be familiar to a UK audience. Less familiar may be contextual in- formation27 , which is provided to admissions tutors in the form of “flags”, identifying disadvantaged students. Admissions tutors are advised to use the contextual information to suggest extra candid- ates to interview. The International Baccalaureate (IB) is an alternative to A-levels where students complete assessments in six subjects. Each student gets a mark out of 45. The Thinking Skills Assessment (TSA) is the admissions test for PPE and E&M applicants. It includes a 90-minute multiple-choice test, marked by the Admissions Testing Service and the marks are made available 26Using final degree class as in the Norrington table, has the additional problem in that it is discrete and thus discards lots of useful information concerning student achievement. This is particularly a problem at Oxford where over 50% of students obtain a 2:1. 27It is sometimes argued that contextual information (and some personal characteristics such as gender and race), should not be controlled for. This is because controlling for contextual information sets lower expectations for some demographics. However, not taking these differences into account may penalise colleges that serve these students for reasons that may be at least partly out of their control. 32
  • 37. Table 2: Description of Control Variables Personal Characteristics Gender Dummy variable indicating whether the student is male or female Ethnicity / Overseas status Dummy variables indicating: “UK White”; “UK Black”; “UK Asian”; “UK Other ethnic group”; “UK Information refused”; “EU” and; “Non-EU” Contextual Information Pre-16 School Flag Performance of applicant’s school at GCSE is below national average Post-16 School Flag Performance of applicant’s school at A-level is below national average Care Flag Applicant has been in-care for more than three months Polar Flag Applicant’s postcode is in POLAR quintiles 1 and 2 - indicating lowest rate of young people’s participation in Higher Education Acorn Flag Applicant’s postcode is in Acorn groups 4 or 5 meaning residents are typically categorised as ‘financially stretched’ or living in ‘urban adversity’ Prior Educational Qualifications Previous school type Dummy variables for State, Independent and other school type GCSEs Dummy variables for proportion of A*s obtained at GCSE (if more than 5 GCSEs). Categories are: “Band 1: 100%”; “Band 2: 75-99%”; “Band 3: 50-74%”; “Band 4: < 50%” and; “Less than 5 GCSEs” A-levels Dummy variables for A-level bands. The categories are: “Did not take A-levels”, “Applied to start prior to 2010”, “Applied to start in 2010 or later and no A*”, “1 A*”, “2 A*”, “3 A*” and “4 or more A*” A-Level subjects Dummy variables indicating whether students had taken A-levels in certain subjects. Subjects for E&M are Economics, Maths and Further Maths. Subjects for Law are History and Law A-Level subject grades Dummy variables indicating the grade achieved in included subjects IB Dummy variables for IB bands. “Band 1: 45 (full marks)”; “Band 2: {43, 44}”; “Band 3: {41, 42}”; “Band 4: ≤ 40” and; “Did not take IB” Admissions Tests and Interviews TSA Variables for TSA critical thinking score and TSA problem solving score LNAT Variables for LNAT multiple choice score and LNAT essay score Interview Score An interview score is given to each candidate out of 10. 33
  • 38. to colleges. The Law National Admissions Test (LNAT) is the admissions test for Law applicants. The LNAT includes a multiple choice section (machine marked out of 42) and an essay section (in- dividually marked by colleges). Interviews are usually face-to-face with admissions tutors and most candidates have have 2 interviews. Law students are given an interview score out of 10. A quick note should also be made about using A-level grades, which is complicated by two factors. First, a new A* grade was introduced in 2010. I create a separate A-level dummy variable for students who applied before the A* grade was introduced. Second, most applicants are only halfway through their A-levels when they apply to Oxford. In this case admissions tutors observe predicted grades which are not available in the data. This should not be too problematic because rational admissions tutors will make correct inferences on average about the actual A-levels grades an applicant will achieve. Actual A-levels achieved are also probably a better measure of ability than predicted grades. 5.4 Sample Selection Sample selection involves choosing both a sample of applicants (only relevant for estimating cut- offs in Model 3) and a sample of enrolled students (relevant for all three models). Fortunately, the datasets contain only a very small amount of missing data. The missing data comes in two forms. First, missing values of control variables for individuals who otherwise provide relatively complete data. For example, a small number of students (12 in PPE, 39 in Law and 0 in E&M) are missing admissions test scores perhaps because they were ill on the day of the test if or there were no available test centres in their home countries (the vast majority are international students with many from outside the EU). Imputing values for these missing covariates is possible. However, the advantages of multiple imputation are minimal at best when missing data is less than 5% of the sample (Manly and Wells, 2015). Multiple imputation also makes interpreting results more difficult (R2 can’t be reported for example). I thus drop these observations (listwise deletion), which is standard practice in the value-added literature. This choice should be taken into account when interpreting the resulting college effect estimates. Second, and more significantly, some students who matriculated at Oxford have missing Prelims 34
  • 39. Table 3: Sample Selection: PPE Applicant Sample (2009-2014)a 9867 Exclusions Not Enrolled at Oxford -8404 Not in Cohorts 2009-14b -7 Withdrew from Oxford -51 Exclude Extreme Outliersc -2 No Admissions Test Scoresd -12 Final Sample 1391 aApplicant sample excludes 53 students who have student records but not application record. This is likely to be because they ap- plied pre-2009, before the dataset begins. bThese students were offered deferred entry. c2 students had Economics marks recorded as 0 or 1. The next lowest mark is 30. It is unclear whether these are typographical errors or true marks. d11 of the 12 students with missing ad- missions test scores were international students with 10 from non-EU countries. Table 4: Sample Selection: All Subjects Applicant Sample (2009-2013)a 75033 Exclusions Not Enrolled at Oxford -61153 Not in Cohorts 2009-2013b -76 No Prelims Averagec -376 St Stephen’s College -1 Final Sample 14427 aExcludes all Medicine and Physiological Science applicants as they are not given “marks” in Prelims. Also excludes Classics I and Classics II in the 2013 Ucas Cycle, Biomedical Science in 2011 and 2012 and Japanese students in 2009 and 2010 as in each case their Prelims scores are all missing. bThese students were offered deferred entry. c210 of these students have officially with- drawn from Oxford and 8 are suspended. Numbers per college range from 31 (Harris Manchester) to 5 (Exeter and Hertford). Table 5: Sample Selection: E&M Applicant Sample (2009-2014) 6874 Exclusions Not Enrolled at Oxford Rejected Before Interview -4615 Rejected After Interview -1638 Declined Offer -24 Withdrew during Process -32 Failed to meet Offer Grades -30 Withdrew After Offer -1 Not in Cohorts 2009-14a -2 Withdrew from Oxfordb -15 Exclude Extreme Outliersc -1 Final Sample 516 aThese students were offered deferred entry. b4 from Pembroke. No more than 1 at any other college. cUnusually low TSA score. Table 6: Sample Selection: Law Applicant Sample (2007-2013) 8148 Exclusions Not Enrolled at Oxford Rejected Before Interview -4094 Rejected After Interview -2440 Declined Offer -59 Withdrew during Process -60 Failed to meet Offer Grades -136 Withdrew After Offer -1 Not in Cohorts 2007-13a -10 Skipped Prelimsb -31 Withdrew before Prelimsc -49 No LNAT/interview scoresd -39 Final Sample 1229 aThese students were offered deferred entry. bMay have come to Oxford with a BA from overseas and been allowed to transfer automat- ically to year 2 without having to sit Prelims. c16 from Harris Manchester. Less than 3 from most other colleges. d24 of the 39 students with missing ad- missions test scores were international students with 22 from non-EU countries. 35
  • 40. scores (51 for PPE, 49 in Law and 15 in E&M). The main reasons are (i) students dropping out of Oxford during their first year and (ii) students taking a year out intending to return and repeat their first year. I again use listwise deletion. This is not ideal because it rewards “cream skimming” (encouraging weaker students not to take exams and perhaps dropout). Bias will result if having missing Prelims scores is an indicator that the student was likely to under-perform relative to their expected result given their pre-Oxford characteristics. Imputing missing prelims scores would also not fully correct for bias. However, missing Prelims scores are rare and seem evenly spread across colleges I do not expect biases to be large.2829 The sample selection criteria are summarised in Tables 3-6. 5.5 Descriptive Statistics Tables 7 and 8 present application, offer and enrolment statistics for each college. The first two columns show that most applicants to Oxford (e.g. over 80% in PPE) are direct applicants. There is large variation in the numbers of direct applicants received by each college. For example, whereas Balliol received 985 direct applications for PPE, St Hilda’s received only 69. The colleges with relatively few direct applicants are allocated large numbers of open applicants (Balliol received 0 open applicants in PPE whereas St Hilda’s received 246). The tables show that almost all colleges make offers to a higher proportion of direct applicants than they to do open applicants, suggesting that the direct applicants are on average of higher ability. Consequently, over 90% of students who take exams at Oxford are direct applicants rather than open applicants. Tables 9-12 present descriptive statistics for applicants and exam takers for each dataset. Columns 1-3 present mean pre-Oxford characteristics of applicants. Columns 1-3 show that open applicants are more likely than direct applicants to be international students (both from the EU or from outside the EU). Open applicants also tend to perform less well in GCSEs, A-levels and admissions tests. 28An exception is a disproportionately large number of students dropout of Harris Manchester which may be related to the fact Harris Manchester is a college for “mature students”. 29If cream skimming is taking place, we might expect to see a positive correlation between college effectiveness estimates and the share of a college’s students that are missing exam results. However, the correlation between the selection on observables estimates and the share of dropouts is −0.86 for PPE, −0.40 for E&M and 0.09 for Law. If anything, the opposite is the case - less effective colleges tend to have larger shares of dropouts. 36