The Student Ratings Debate Continued: What Has Changed?

Student Ratings Debate 1

Running head: STUDENT RATINGS DEBATE

The Student Ratings Debate Continued: What Has Changed?

Matthew J. Hendrickson

Ball State University

ID 602: Institutional Research


The Student Ratings Debate Continued: What Has Changed?

The debate over the usefulness and applicability of student ratings (SRs) has been an

ongoing problem. The original purpose was to aid administrators in monitoring teaching quality

and to help faculty improve their teaching (Guthrie, 1954, in Kulik, 2001). Today, we seem to

be far removed from this original concept, as many institutions tend to use these ratings as the

bulk of faculty reviews (Abrami, 2001b). More specifically, this concept primarily concerns

reviews for tenure positions and salary increases (Ory & Ryan, 2001).

To date, over 2000 studies have focused on student evaluations of college teachers (Safer,

Farmer, Segalla, & Elhoubi, 2005). In these studies, the main factors that have been found to

effect student evaluations are: subject matter taught, classroom instructor, rank of the instructor,

the student’s expected grade, student major, whether the course is an elective or is required, class

enrollment, the enthusiasm and warmth of the instructor, and the course level. There have also

been a few, yet less common additions to the literature in recent years, including the use of

humor by the instructor (Adamson, O’Kane, & Shelvin, 2005), and closeness of the faculty to the

students (Safer et al., 2005). However, in the past few years, many of these issues have fallen to

the background as the strongest debates concern the student’s expected grade (i.e., Centra, 2003;

Griffin, 2004; Heckert, Latier, Ringawld, & Silvey, 2006; Maurer, 2006) and issues with validity

(Olivares, 2003; Renaud & Murray, 2005; Theall, Abrami, & Mets, 2001).

The common theme behind the criticism and debate on the usefulness and applicability of

student ratings is focused on the repeated finding that higher grades are correlated with higher

student satisfaction and higher teacher ratings (Cohen, 1981, in Kulik, 2001; Kulik, 2001; Safer

et al., 2005; to name a few). However, others maintain that there is no causal relationship

between student grades and teacher ratings, or that these differences may be due to different


factors, such as the ability of the student and the types of students who sign up for particular

courses (i.e., upper division courses, major course, etc.; Centra, 2003; Theall & Franklin, 2001).

An expansion of this topic concerns a few new considerations of non-explicit behaviors, such as

humor and closeness of instructors to students (Adamson et al., 2005; Safer et al., 2005).

Validity of student ratings has been in question in the literature for some time now, with

an entire monograph dedicated to this idea, and even more research in the years following

(Olivares, 2003; Renaud & Murray, 2005; Theall et al., 2001). Topics included in this argument

focus on the premise that SRs are not valid for use in faculty promotion and tenure decisions,

although they are useful for the development and learning of the instructors in the attempt to

become better teachers. Once again, there are conflicting viewpoints on this issue. Abrami

(2001a) suggest that these ratings are in fact usable and beneficial, although there is need for

some revision of these forms to eliminate confounding variables and human biases. On the other

hand, Olivares (2003) suggests that although SRs may benefit instructors in the learning process,

these surveys should be used with caution, as they are not useable as a measure of teaching

effectiveness. Renaud and Murray (2005) posited that the systematic distortion hypothesis

should be taken into account when considering SRs.

The last topic of study for this review is the perspectives of faculty and students on both

the course and teacher evaluations (Schmelkin, Spencer, & Gellman, 1997) and teaching and its

evaluation (Spencer & Schmelkin, 2002). This aspect is along the same lines of the student

ratings debate, but takes on a different view than most of the work in the area. It delves further

into the realm of summative evaluation as it pertains to faculty satisfaction (Schmelkin et al.,

1997) and student’s willingness to complete the SRs as well as their thoughts on whether the SR

results were taken seriously by the faculty (Spencer & Schmelkin, 2002).


Summary

Expected Grades

Although there have been many issues concerning student ratings in the past, the most

prominent in the past 5 years has been the effect of expected grades on SRs. Starting with the

ideas proposed by Greenwald and Gillmore (1997), an explosion in the study of expected grades

occurred. The ensuing wave of research has done nothing more than create mixed results.

Centra (2003) posited that courses rated at the “just right” level, as opposed to too easy or too

hard, were rated the highest. In stark contrast, Griffin (2004) claims that an instructor’s grading

leniency as perceived by students was positively associated with almost every dimension

examined. A new theory was introduced stating that student ratings appear unrelated to the

ability to punish instructors, thus finding a link between student ratings and cognitive dissonance

theory (Maurer, 2006).

As proposed by Greenwald and Gillmore (1997a), there are 5 main theories of the grade-

ratings correlation. They are:

1. Teaching effectiveness influences both grades and ratings.

2. Students’ general academic motivation influences both grades and ratings.

3. Students’ course-specific motivation influences both grades and ratings.

4. Students infer course quality and own ability from received grades.

5. Students give high ratings in appreciation for lenient grading. (p. 1210-1211)

Greenwald and Gillmore (1997b) further proposed a Grading Leniency Model as an

attempt to remove the unwanted effects of grading leniency and SRs. The model considered the

course and instructor, self-reported progress, having the same instructor again, absolute expected

grade, relative expected grade, the challenge of the class, the effort involved in the class,


involvement in the class, and hours worked per credit. The findings suggested that courses that

gave higher grades were better liked and these coursed had lighter workloads.

Centra (2003) used data from over 50,000 college courses taught by teachers who used

the Student Instructional Report II (Centra, 1972; in Centra, 2003). After controlling for learning

outcomes, expected grades generally did not affect SRs. In fact, contrary to what most faculty

think, course in natural science where students expected A’s were rated lower, not higher. This

goes against the premonition that the “easy class” or “easy grading” in tough classes gains higher

SR scores. This study also found that courses rated at the “just right” level, versus too difficult

or too easy, were actually rated highest, which is in stark contrast to Greenwald and Gillmore’s

(1997a; 1997b) findings. This suggests that students feel instruction is most effective when they

are able to manage the course with their level of preparation and ability.

Student perceptions of grade leniency have been shown to be positively associated with

higher ratings of instructors (Griffin, 2004). Griffin assessed the three most popular explanations

for this positive correlation. They are:

1) The positive correlation between expected grade and student ratings of instruction may be

explained as indicating a valid measurement of student ratings since better instruction should

result in more learning, better grades, and better ratings.

2) The association between expected grades and ratings of instruction could be spurious and

produced for various student characteristics such as motivation.

3) An association between expected grades and ratings could reflect some type of biasing effect.

(p.411)

Griffin suggested that there was support for all three of these ideas, although they show

varying levels. However, he posited that the most likely and perhaps the strongest effect is that

of the third possibility; a biasing effect on ratings. In addition to the biasing effect, he stated that

there appears to be a mix between these biasing effects and valid teaching and learning


combinations. The biasing factors discussed above suggest a penalty effect where students who

received lower than expected grades consistently provided ratings lower than the rest of the

students. Griffin explains these findings as being caused by a self-serving bias. The self-serving

bias states that “a student will attempt to protect his or her view of self and assign blame for the

lower than expected performance to an external cause. The likely target will be the instructor, so

the student will rate the instructor lower, thus a rating penalty effect will occur” (p. 412).

This was not, however, what Maurer (2006) found. The results suggest that student

ratings do not appear to be related to the ability to punish instructors. Although he agrees that

there is a biasing effect of expected grades and SRs, he also suggests that this is not due to a

penalty effect, or ability to punish instructors. Rather, he suggested that cognitive dissonance

theory has a role in SRs concerning negative reviews. The basis for this argument is that there is

little evidence to suggest a link with revenge, and that most students are either unaware of SRs

use in personnel decisions or they do not believe their ratings will have an effect on these

decisions. The cognitive dissonance theory maintains that when students expect to receive a

high grade but actually receive a low grade, they are confronted with a discrepancy that they

must explain. But for this to be true, only ratings of the instructor would be influenced by

expected grade, and ratings of other elements of the course (textbook relevance, etc.) would

remain unaffected. The findings supported this assertion, leading to the conclusion that expected

grades may be influenced by cognitive dissonance theory.

Non-explicit Behaviors

Non-explicit behaviors are argued to create problems with the current data in the area,

suggesting that SRs may not be assessing teaching effectiveness, but may really be assessing

other factors, such as the amount of humor shown by the instructor (Adamson et al., 2005) or the


distance students are from the instructor (Safer et al., 2005). The majority of the studies

conducted on student ratings focus some attention on non-explicit factors that influence SRs,

whether that is the aim of the study or not. Most articles consider this idea in their introductions

and some consider it in their discussions, although most do not attempt to assess or test these

concepts. There has been a lot of attention on the area, but very little clear evidence about the

effect of these influences.

As Adamson, O’Kane, and Shelvin (2005) have shown, there is a significant positive

relationship between the humor used, or “funniness” of the instructor to the students’ overall

ratings. Also of interest when considering non-explicit biases is the distance of the teacher to the

student (Safer et at., 2005). This study suggested 1) ratings of instructors varied sizably, 2)

student grades positively correlated to their SRs of their instructors, and most importantly, 3) the

number of rows per classroom was negatively associated with SRs. It is further suggested that

the relationship between class enrollment and SRs has had a significant relationship, but it has

thus far been ignored. These two studies suggest that, even though non-explicit biasing factors

have been prevalent, they have fallen to the background while issues of validity and utility have

been argued.

Validity

The writing of the monograph edited by Theall, Abrami, and Mets (2001), illustrated

many problems with the validity of SRs. Since then, there has been continued arguing about the

validity and usability of student ratings. Abrami (2001a) put it best by stating that these SRs

may be flawed in some design, but there should be great effort put into trying to work them out

as to create utility within these surveys for the betterment of education. For instance, by adding

more mathematical conditions and formulas to the scoring of SRs, Abrami felt that many of the


biases, non-explicit or non-verbal behaviors, and even faculty and student perspectives could be

changed and better surveys may be developed and used for their intended purpose; to foster

changes in teaching styles to create better faculty and instructors at one’s institution.

Along these same lines, Renaud and Murray (2005) are also proponents of SRs. They

posited that “the literature indicates that student ratings of teaching effectiveness are positively

related to objective measures of student learning, and thus can be seen as valid indicators of

instructional quality” (p. 929). By the use of the systematic distortion hypothesis (SDH), which

states that traits can be judged as correlated when in reality they do not correlate or barely

correlate, Renaud and Murray (2005) attempt to explain away some of the problems plaguing

SRs. By using three correlation matrices; one on ratings of personality traits, one on conceptual

associations between the same traits, and one on direct observation of behaviors corresponding to

these personality traits, one can infer correlations that are thought to exist in the minds of those

who rate these correlations. For example, students may rate their professors as being more

accessible outside of class because of their effectiveness, as many of these students did not need

that professor outside of the classroom, they combined these traits and posited that he must have

been, and would have been accessible if needed. This difference of correlations focuses on two

types of accuracy; stereotype accuracy, “the extent to which a profile of ratings agree with the

traits or behaviors of an average or typical member of the group which the rate represents” and

differential accuracy, “the extent to which ratings of a particular individual are congruent with

that person’s actual profile” (p. 948).

Olivares (2003) provided an in depth analysis of the conceptualization of SRs, as well as

the analysis of many different types of validity and their connections to SRs. He argues that the

content validity of SRs is lacking because they do not assess the universality of teacher


effectiveness. Criterion validity seems lacking because the inference must hold that highly rated

teachers are effective, where lower rated teachers are ineffective. Concerning construct validity,

Olivares suggests that the multitrait, multimethod matrix (MTMM) should be used as a means to

determine whether SRs are truly measuring the construct in question, teacher effectiveness. By

combining these two methods, convergent validity can be assessed, as they should measure the

same things. He further points out that supporters of SRs have agreed upon the problem that

teacher effectiveness has not been operationalized concretely. This poses the problem that there

is no clear criterion measure of instructional effectiveness. Concerning both parties, the

statement that both proponents and opponents of SRs have sought primarily to confirm their

respective hypotheses, rather than to disprove them, adds further to the problem of codifying

SRs. To further this point, he goes on to say that no empirical evidence is present to suggest that

widespread implementation of teacher ratings has resulted in more effective teachers or better

learned and more knowledgeable students.

Faculty and Student Perspectives

It is interesting to note that given the large amount of focus on the SRs and their

problems, very few studies have focused on the perspectives of faculty and students toward

course and teacher evaluations. Schmelkin and Spencer have taken this alternative approach to

SRs and have thus far assessed faculty perspectives (Schmelkin et al., 1997) and student

perspectives (Spencer & Schmelkin, 2002). At the end of the latter, there is a comment on their

intent to assess administration perspectives.

Schmelkin et al. (1997) explored faculty perspectives on the usefulness of student ratings

concerning both formative and summative purposes, as well as the actual use of SRs for

summative purposes. By examining resistance or acceptance of SRs among the faculty, as well


as their general attitudes toward SRs and the faculty’s perceptions of the use of SRs in

administrative decisions; it was found that faculty members do not show much resistance to SRs

or toward their use in formative or summative evaluations by the administration. The faculty

reported, in order of high to low importance, that feedback information on their interactions with

students, feedback on grading practices, global ratings of the instructor and course, and structural

issues of the course were found to be most useful. Faculty also rated assistance by professional

teaching consultants as very important regarding interpreting SR feedback. Overall, faculty rated

SRs as useful.

Spencer and Schmelkin (2002) looked at student perspectives concerning SRs assessing

teaching and its evaluation. The overarching theme was that students are generally willing to

complete evaluations and provide feedback with no particular fear of repercussions. It has also

been found that although students have no major qualms about completing SRs, they are unsure

of the overall weight these reports have on the administration and faculty. The students overall

wish seems to be “to have an impact, but their lack of (a) confidence in the use of the results; and

(b) knowledge of just how to influence teaching, is reflected in the observation that they do not

even consult the public results of student ratings” (p. 406).

Evaluation

Strengths

These issues surveyed here have been relevant since the inception of student ratings.

Although there are apparent differences and difficulties concerning the use and usefulness of

SRs, there has been a good deal of literature on the topic in an attempt to remedy these problems

for the betterment of teaching. Along the same lines, even though there are inherent problems

with SRs, the general populous of academia can now become familiar with these issues and be


aware of them when making faculty decisions, as well as making decisions on how to use the

data collected through SRs.

Knowing the issues of expected grades, non-explicit behaviors, validity, and faculty and

student perspectives allows the administration and faculty to improve not only their institution,

but also their teaching styles. These reasons also provide evidence for summative ratings, which

include alumni ratings, outside observers, SRs, etc. By using these different methods, it is thus

possible to reduce the effect of the biases presented above on the evaluation of faculty.

Weaknesses

The weaknesses of these articles are that they show inconsistent data rendered,

methodology, conceptual frameworks, and even problematic assertions. The incompatibility of

these studies makes it difficult to compare across articles and topics to create a cohesive picture

of SRs. For instance, considering validity of SRs, the literature cannot come up with consistent

definitions of construct validity, which should consist of teaching effectiveness. However, the

literature is divided into different camps assessing small differences on this topic. If validity

issues cannot be resolved, there is little hope for a cohesive construction of SRs in the future, as

validity issues are the backbone of any conceptual framework and method of study. Beyond this,

other weaknesses cannot be duly and fairly assessed until this problem is resolved. Given the

unrelenting problems with these issues concerning SRs, it has become nearly impossible to

further investigate smaller problems and contributors to the initial problem of the SRs.

Conclusion

The findings provide mixed support for the possibilities concerning the effect of biasing

factors on SRs. First, I considered expected grades. With contradictory results from Centra

(2003), Griffin (2004), and Maurer (2006), there is a lot of conceptual work that needs to be done


in order to find a common ground to make a decision about whether there is a biasing factor of

expected grades on SRs. Perhaps this inconsistency is caused by some factor that has not yet

been found. There is also the possibility that this inconsistency is due to varying definitions and

measures of SR constructs, such as the actual items given to students on their SR forms. In order

to resolve this issue in the future, there will need to be a cohesive definition of expected grades, a

firm conceptual basis that is in agreeable with both sides of the argument, and possibly even a

different path to be followed that includes looking for other related causes that may appear to be

expected grades, but in all actuality is something else.

Second, non-explicit behaviors have been popular to mention, but not as popular to study

in the recent literature. Studies point to these factors as being significantly related to the

construct of teaching effectiveness. The behaviors of interest in this review were not just

expected grades, but also humor of the instructor and closeness of the instructor to the students.

This suggests that teaching effectiveness is either not uni-dimensional as it has been portrayed in

the past, or there are just subcategories that must be considered and accounted for in the results

of SRs.

Third, recent years have shown the debate of the usefulness and utility of SRs has been a

hotly debated issue. Once again, there are conflicting viewpoints. One side is embraced by the

supporters of SRs. Perhaps the most dominant of these proponents is Abrami (2001a; 2001b; for

examples), with support from others like Renaud and Murray (2005). Together, they feel that

SRs should be used, albeit with minor changes, for the betterment of the educational system. On

the other side of the debate sits Olivares (2003), who feels that the inherent problems of SRs are

too many and too problematic to fix; suggesting there should be other methods in place to assess

teaching effectiveness. As Olivares (2003) put it, “data suggests that the institutionalization of


SRTs [SRs] as a method to evaluate teacher effectiveness has resulted in students learning less in

environments that have become less learning- and more consumer-oriented” (p. 243; emphasis

in original).

As the issue of utility persists, once again a conceptualization of what should be

considered good or bad by definition needs to be established to determine whether SRs are worth

the effort or a waste of time and resources to the institution. Each institution will need to

establish their need for SRs and whether they intend on using them in the future. Although I feel

each institution should use SRs to aid their faculty, the level to which these reports are used is

ultimately up to the institution.

Lastly, studies by Schmelkin et al. (1997) and Spencer and Schmelkin (2002) indicated

that the general feelings of both students and faculty concerning SRs are relatively positive. The

problems persist that the feedback forms are often not explained to the faculty and thus provide

little aid in faculty development. The students also seem to have little reservation in completing

SRs, yet they are uncertain of the effect that these surveys have on both the administration and

the faculty.

Application

It appears that in the past few years, there has been very little change in the literature

about Student Ratings. These forms have taken a prominent position in institutions of all sizes,

but there is a continued debate as to their usefulness or applicability. So what is next? As

student ratings seem to be here for the long run and have such a strong following at the

institutional level, there needs to be some sort of codification that allows for SRs to be useful

tools, as they were originally intended decades ago.


Once we are able to find a common ground for SRs, at least at an individual institution

level, these schools will be able to assess the utility of their own SRs, as well as make changes to

them in order to get the necessary information needed to assess their faculty. As proposed by

Scriven (1983; in Kulik 2001) and Theall and Franklin (2001), among others, the use of

summative evaluations still seems to be among the best methods to assess faculty teaching

effectiveness. By bringing in outside observers, alumni ratings, and even interviews with the

faculty, it is possible to look at more “pieces of the puzzle” if you will, versus the inconsistent

findings of SRs.

The issues with finding the best measures for one’s institution and assessing the utility of

the measure vary drastically. Studies have illustrated that, other than the inconsistencies between

positive and negative findings, that there are issues with biasing factors, such as expected grades

(Centra, 2003; Greenwald & Gillmore, 1997; Griffin, 2004), including non-explicit factors

(Adamson et al., 2005; Safer et al., 2005), validity (Abrami, 2001a, 2001b; Olivares, 2003;

Renaud & Murray, 2005), and preferences by the students and faculty (Schmelkin et al., 1997;

Spencer & Schmelkin, 2002). Once these issues are resolved, or the institutions who choose to

use SRs decide the emphasis placed on SRs given these issues, they can administer these

surveys.

After the institution has decided upon the measures they feel will assess the construct of

teaching effectiveness, they must communicate the results of these assessments clearly with their

faculty. As Penny and Coe (2004) suggest, the communication and clarification of these results

to the faculty is the only way to have increased certainty that these measures are being used to

their full potential.


References

Abrami, P. C. (2001a). Improving judgments about teaching effectiveness: How to lie without

statistics. New Directions for Institutional Research, 27 (5), 97-102.

Abrami, P. C. (2001b). Improving judgments about teaching effectiveness using teacher rating

forms. New Directions for Institutional Research, 27 (5), 59-87.

Adamson, G., O’Kane, D., & Shelvin, M. (2005). Student’s ratings of teaching effectiveness: A

laughing matter? Psychological Reports, 96, 225-226.

Centra, J. A. (1972). The Student Instructional Report: Its Development and Uses, Educational

Testing Services, Princeton, NJ.

Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades

and less course work? Research in Higher Education, 44, 495-518.

Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis.

Research in Higher Education, 13, 321-341.

Greenwald, A. G., & Gillmore, G. M. (1997a). Grading lenience is a removable contaminant of

student ratings. American Psychologist, 52, 1209-1217.

Greenwald, A. G., & Gillmore, G. M. (1997b). No pain, no gain? The importance of measuring

course workload in student ratings of instruction. Journal of Educational Psychology,

89, 743-751.

Griffin, B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction.

Contemporary Educational Psychology, 29, 410-425.

Guthrie, E. R. (1954). The Evaluation of Teaching: A Progress Report. Seattle: University of

Washington.


Heckert, T. M., Latier, A., Ringwald, A., & Silvey, B. (2006). Relation of course, instructor, and

student characteristics to dimensions of student ratings of teaching effectiveness. College

Student Journal, 40, 195-203.

Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New Directions for

Institutional Research, 27 (5), 9-25.

Maurer, T. W. (2006). Cognitive dissonance or revenge? Student grades and course evaluations.

Teaching of Psychology, 33 (3), 176-179.

Olivares, O. J. (2003). A conceptual and analytic critique of student ratings of teachers in the

USA with implications for teacher effectiveness and student learning. Teaching in

Higher Education, 8, 233-245

Ory, J. C., & Ryan, K. (2001). How do student ratings measure of to a new validity framework?

New Directions for Institutional Research, 27 (5), 27-44.

Penny, A. R., & Coe, R. (2004). Effectiveness of consultation on student ratings feedback: A

meta-analysis. Review of Educational Research, 74, 215-252.

Renaud, R. D., & Murray, H. G. (2005). Factorial validity of student ratings of instruction.

Research in Higher Education, 46, 929-953.

Safer, A. M., Farmer, L. S. J., Segalla, A., & Elhoubi, A. F. (2005). Does the distance from the

teacher influence student evaluation? Educational Research Quarterly, 28 (3), 28-35.

Schmelkin, L. P., Spencer, K. J., & Gellman, E. S. (1997). Faculty perspectives on course and

teacher evaluations. Research in Higher Education, 38, 575-592.

Scriven, M. (1983). “Summative Teacher Evaluation.” In J. Milman (ed.), Handbook of

Teacher Evaluation. Thousand Oaks, CA: Sage.


Spencer, K. J., & Schmelkin, L. P. (2002). Student perspectives on teaching and its evaluation.

Assessment & Evaluation in Higher Education, 27, 397-409.

Theall, M., Abrami, P. C., & Mets, L. A. (2001). The student ratings debate: Are they valid?

How can we best use them? New Directions for Institutional Research, 27 (5, Serial No.

109).

Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or

a witch hunt in student ratings of instruction? New Directions for Institutional Research,

27 (5), 45-56.

The Student Ratings Debate Continued: What Has Changed?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to The Student Ratings Debate Continued: What Has Changed?

Similar to The Student Ratings Debate Continued: What Has Changed? (20)

More from Matthew Hendrickson

More from Matthew Hendrickson (11)

The Student Ratings Debate Continued: What Has Changed?