The Student Ratings Debate Continued: What Has Changed?
Student Ratings Debate 1Running head: STUDENT RATINGS DEBATE The Student Ratings Debate Continued: What Has Changed? Matthew J. Hendrickson Ball State University ID 602: Institutional Research
Student Ratings Debate 2 The Student Ratings Debate Continued: What Has Changed? The debate over the usefulness and applicability of student ratings (SRs) has been anongoing problem. The original purpose was to aid administrators in monitoring teaching qualityand to help faculty improve their teaching (Guthrie, 1954, in Kulik, 2001). Today, we seem tobe far removed from this original concept, as many institutions tend to use these ratings as thebulk of faculty reviews (Abrami, 2001b). More specifically, this concept primarily concernsreviews for tenure positions and salary increases (Ory & Ryan, 2001). To date, over 2000 studies have focused on student evaluations of college teachers (Safer,Farmer, Segalla, & Elhoubi, 2005). In these studies, the main factors that have been found toeffect student evaluations are: subject matter taught, classroom instructor, rank of the instructor,the student’s expected grade, student major, whether the course is an elective or is required, classenrollment, the enthusiasm and warmth of the instructor, and the course level. There have alsobeen a few, yet less common additions to the literature in recent years, including the use ofhumor by the instructor (Adamson, O’Kane, & Shelvin, 2005), and closeness of the faculty to thestudents (Safer et al., 2005). However, in the past few years, many of these issues have fallen tothe background as the strongest debates concern the student’s expected grade (i.e., Centra, 2003;Griffin, 2004; Heckert, Latier, Ringawld, & Silvey, 2006; Maurer, 2006) and issues with validity(Olivares, 2003; Renaud & Murray, 2005; Theall, Abrami, & Mets, 2001). The common theme behind the criticism and debate on the usefulness and applicability ofstudent ratings is focused on the repeated finding that higher grades are correlated with higherstudent satisfaction and higher teacher ratings (Cohen, 1981, in Kulik, 2001; Kulik, 2001; Saferet al., 2005; to name a few). However, others maintain that there is no causal relationshipbetween student grades and teacher ratings, or that these differences may be due to different
Student Ratings Debate 3factors, such as the ability of the student and the types of students who sign up for particularcourses (i.e., upper division courses, major course, etc.; Centra, 2003; Theall & Franklin, 2001).An expansion of this topic concerns a few new considerations of non-explicit behaviors, such ashumor and closeness of instructors to students (Adamson et al., 2005; Safer et al., 2005). Validity of student ratings has been in question in the literature for some time now, withan entire monograph dedicated to this idea, and even more research in the years following(Olivares, 2003; Renaud & Murray, 2005; Theall et al., 2001). Topics included in this argumentfocus on the premise that SRs are not valid for use in faculty promotion and tenure decisions,although they are useful for the development and learning of the instructors in the attempt tobecome better teachers. Once again, there are conflicting viewpoints on this issue. Abrami(2001a) suggest that these ratings are in fact usable and beneficial, although there is need forsome revision of these forms to eliminate confounding variables and human biases. On the otherhand, Olivares (2003) suggests that although SRs may benefit instructors in the learning process,these surveys should be used with caution, as they are not useable as a measure of teachingeffectiveness. Renaud and Murray (2005) posited that the systematic distortion hypothesisshould be taken into account when considering SRs. The last topic of study for this review is the perspectives of faculty and students on boththe course and teacher evaluations (Schmelkin, Spencer, & Gellman, 1997) and teaching and itsevaluation (Spencer & Schmelkin, 2002). This aspect is along the same lines of the studentratings debate, but takes on a different view than most of the work in the area. It delves furtherinto the realm of summative evaluation as it pertains to faculty satisfaction (Schmelkin et al.,1997) and student’s willingness to complete the SRs as well as their thoughts on whether the SRresults were taken seriously by the faculty (Spencer & Schmelkin, 2002).
Student Ratings Debate 4 SummaryExpected Grades Although there have been many issues concerning student ratings in the past, the mostprominent in the past 5 years has been the effect of expected grades on SRs. Starting with theideas proposed by Greenwald and Gillmore (1997), an explosion in the study of expected gradesoccurred. The ensuing wave of research has done nothing more than create mixed results.Centra (2003) posited that courses rated at the “just right” level, as opposed to too easy or toohard, were rated the highest. In stark contrast, Griffin (2004) claims that an instructor’s gradingleniency as perceived by students was positively associated with almost every dimensionexamined. A new theory was introduced stating that student ratings appear unrelated to theability to punish instructors, thus finding a link between student ratings and cognitive dissonancetheory (Maurer, 2006). As proposed by Greenwald and Gillmore (1997a), there are 5 main theories of the grade-ratings correlation. They are: 1. Teaching effectiveness influences both grades and ratings. 2. Students’ general academic motivation influences both grades and ratings. 3. Students’ course-specific motivation influences both grades and ratings. 4. Students infer course quality and own ability from received grades. 5. Students give high ratings in appreciation for lenient grading. (p. 1210-1211) Greenwald and Gillmore (1997b) further proposed a Grading Leniency Model as anattempt to remove the unwanted effects of grading leniency and SRs. The model considered thecourse and instructor, self-reported progress, having the same instructor again, absolute expectedgrade, relative expected grade, the challenge of the class, the effort involved in the class,
Student Ratings Debate 5involvement in the class, and hours worked per credit. The findings suggested that courses thatgave higher grades were better liked and these coursed had lighter workloads. Centra (2003) used data from over 50,000 college courses taught by teachers who usedthe Student Instructional Report II (Centra, 1972; in Centra, 2003). After controlling for learningoutcomes, expected grades generally did not affect SRs. In fact, contrary to what most facultythink, course in natural science where students expected A’s were rated lower, not higher. Thisgoes against the premonition that the “easy class” or “easy grading” in tough classes gains higherSR scores. This study also found that courses rated at the “just right” level, versus too difficultor too easy, were actually rated highest, which is in stark contrast to Greenwald and Gillmore’s(1997a; 1997b) findings. This suggests that students feel instruction is most effective when theyare able to manage the course with their level of preparation and ability. Student perceptions of grade leniency have been shown to be positively associated withhigher ratings of instructors (Griffin, 2004). Griffin assessed the three most popular explanationsfor this positive correlation. They are: 1) The positive correlation between expected grade and student ratings of instruction may be explained as indicating a valid measurement of student ratings since better instruction should result in more learning, better grades, and better ratings. 2) The association between expected grades and ratings of instruction could be spurious and produced for various student characteristics such as motivation. 3) An association between expected grades and ratings could reflect some type of biasing effect. (p.411) Griffin suggested that there was support for all three of these ideas, although they showvarying levels. However, he posited that the most likely and perhaps the strongest effect is thatof the third possibility; a biasing effect on ratings. In addition to the biasing effect, he stated thatthere appears to be a mix between these biasing effects and valid teaching and learning
Student Ratings Debate 6combinations. The biasing factors discussed above suggest a penalty effect where students whoreceived lower than expected grades consistently provided ratings lower than the rest of thestudents. Griffin explains these findings as being caused by a self-serving bias. The self-servingbias states that “a student will attempt to protect his or her view of self and assign blame for thelower than expected performance to an external cause. The likely target will be the instructor, sothe student will rate the instructor lower, thus a rating penalty effect will occur” (p. 412). This was not, however, what Maurer (2006) found. The results suggest that studentratings do not appear to be related to the ability to punish instructors. Although he agrees thatthere is a biasing effect of expected grades and SRs, he also suggests that this is not due to apenalty effect, or ability to punish instructors. Rather, he suggested that cognitive dissonancetheory has a role in SRs concerning negative reviews. The basis for this argument is that there islittle evidence to suggest a link with revenge, and that most students are either unaware of SRsuse in personnel decisions or they do not believe their ratings will have an effect on thesedecisions. The cognitive dissonance theory maintains that when students expect to receive ahigh grade but actually receive a low grade, they are confronted with a discrepancy that theymust explain. But for this to be true, only ratings of the instructor would be influenced byexpected grade, and ratings of other elements of the course (textbook relevance, etc.) wouldremain unaffected. The findings supported this assertion, leading to the conclusion that expectedgrades may be influenced by cognitive dissonance theory.Non-explicit Behaviors Non-explicit behaviors are argued to create problems with the current data in the area,suggesting that SRs may not be assessing teaching effectiveness, but may really be assessingother factors, such as the amount of humor shown by the instructor (Adamson et al., 2005) or the
Student Ratings Debate 7distance students are from the instructor (Safer et al., 2005). The majority of the studiesconducted on student ratings focus some attention on non-explicit factors that influence SRs,whether that is the aim of the study or not. Most articles consider this idea in their introductionsand some consider it in their discussions, although most do not attempt to assess or test theseconcepts. There has been a lot of attention on the area, but very little clear evidence about theeffect of these influences. As Adamson, O’Kane, and Shelvin (2005) have shown, there is a significant positiverelationship between the humor used, or “funniness” of the instructor to the students’ overallratings. Also of interest when considering non-explicit biases is the distance of the teacher to thestudent (Safer et at., 2005). This study suggested 1) ratings of instructors varied sizably, 2)student grades positively correlated to their SRs of their instructors, and most importantly, 3) thenumber of rows per classroom was negatively associated with SRs. It is further suggested thatthe relationship between class enrollment and SRs has had a significant relationship, but it hasthus far been ignored. These two studies suggest that, even though non-explicit biasing factorshave been prevalent, they have fallen to the background while issues of validity and utility havebeen argued.Validity The writing of the monograph edited by Theall, Abrami, and Mets (2001), illustratedmany problems with the validity of SRs. Since then, there has been continued arguing about thevalidity and usability of student ratings. Abrami (2001a) put it best by stating that these SRsmay be flawed in some design, but there should be great effort put into trying to work them outas to create utility within these surveys for the betterment of education. For instance, by addingmore mathematical conditions and formulas to the scoring of SRs, Abrami felt that many of the
Student Ratings Debate 8biases, non-explicit or non-verbal behaviors, and even faculty and student perspectives could bechanged and better surveys may be developed and used for their intended purpose; to fosterchanges in teaching styles to create better faculty and instructors at one’s institution. Along these same lines, Renaud and Murray (2005) are also proponents of SRs. Theyposited that “the literature indicates that student ratings of teaching effectiveness are positivelyrelated to objective measures of student learning, and thus can be seen as valid indicators ofinstructional quality” (p. 929). By the use of the systematic distortion hypothesis (SDH), whichstates that traits can be judged as correlated when in reality they do not correlate or barelycorrelate, Renaud and Murray (2005) attempt to explain away some of the problems plaguingSRs. By using three correlation matrices; one on ratings of personality traits, one on conceptualassociations between the same traits, and one on direct observation of behaviors corresponding tothese personality traits, one can infer correlations that are thought to exist in the minds of thosewho rate these correlations. For example, students may rate their professors as being moreaccessible outside of class because of their effectiveness, as many of these students did not needthat professor outside of the classroom, they combined these traits and posited that he must havebeen, and would have been accessible if needed. This difference of correlations focuses on twotypes of accuracy; stereotype accuracy, “the extent to which a profile of ratings agree with thetraits or behaviors of an average or typical member of the group which the rate represents” anddifferential accuracy, “the extent to which ratings of a particular individual are congruent withthat person’s actual profile” (p. 948). Olivares (2003) provided an in depth analysis of the conceptualization of SRs, as well asthe analysis of many different types of validity and their connections to SRs. He argues that thecontent validity of SRs is lacking because they do not assess the universality of teacher
Student Ratings Debate 9effectiveness. Criterion validity seems lacking because the inference must hold that highly ratedteachers are effective, where lower rated teachers are ineffective. Concerning construct validity,Olivares suggests that the multitrait, multimethod matrix (MTMM) should be used as a means todetermine whether SRs are truly measuring the construct in question, teacher effectiveness. Bycombining these two methods, convergent validity can be assessed, as they should measure thesame things. He further points out that supporters of SRs have agreed upon the problem thatteacher effectiveness has not been operationalized concretely. This poses the problem that thereis no clear criterion measure of instructional effectiveness. Concerning both parties, thestatement that both proponents and opponents of SRs have sought primarily to confirm theirrespective hypotheses, rather than to disprove them, adds further to the problem of codifyingSRs. To further this point, he goes on to say that no empirical evidence is present to suggest thatwidespread implementation of teacher ratings has resulted in more effective teachers or betterlearned and more knowledgeable students.Faculty and Student Perspectives It is interesting to note that given the large amount of focus on the SRs and theirproblems, very few studies have focused on the perspectives of faculty and students towardcourse and teacher evaluations. Schmelkin and Spencer have taken this alternative approach toSRs and have thus far assessed faculty perspectives (Schmelkin et al., 1997) and studentperspectives (Spencer & Schmelkin, 2002). At the end of the latter, there is a comment on theirintent to assess administration perspectives. Schmelkin et al. (1997) explored faculty perspectives on the usefulness of student ratingsconcerning both formative and summative purposes, as well as the actual use of SRs forsummative purposes. By examining resistance or acceptance of SRs among the faculty, as well
Student Ratings Debate 10as their general attitudes toward SRs and the faculty’s perceptions of the use of SRs inadministrative decisions; it was found that faculty members do not show much resistance to SRsor toward their use in formative or summative evaluations by the administration. The facultyreported, in order of high to low importance, that feedback information on their interactions withstudents, feedback on grading practices, global ratings of the instructor and course, and structuralissues of the course were found to be most useful. Faculty also rated assistance by professionalteaching consultants as very important regarding interpreting SR feedback. Overall, faculty ratedSRs as useful. Spencer and Schmelkin (2002) looked at student perspectives concerning SRs assessingteaching and its evaluation. The overarching theme was that students are generally willing tocomplete evaluations and provide feedback with no particular fear of repercussions. It has alsobeen found that although students have no major qualms about completing SRs, they are unsureof the overall weight these reports have on the administration and faculty. The students overallwish seems to be “to have an impact, but their lack of (a) confidence in the use of the results; and(b) knowledge of just how to influence teaching, is reflected in the observation that they do noteven consult the public results of student ratings” (p. 406). EvaluationStrengths These issues surveyed here have been relevant since the inception of student ratings.Although there are apparent differences and difficulties concerning the use and usefulness ofSRs, there has been a good deal of literature on the topic in an attempt to remedy these problemsfor the betterment of teaching. Along the same lines, even though there are inherent problemswith SRs, the general populous of academia can now become familiar with these issues and be
Student Ratings Debate 11aware of them when making faculty decisions, as well as making decisions on how to use thedata collected through SRs. Knowing the issues of expected grades, non-explicit behaviors, validity, and faculty andstudent perspectives allows the administration and faculty to improve not only their institution,but also their teaching styles. These reasons also provide evidence for summative ratings, whichinclude alumni ratings, outside observers, SRs, etc. By using these different methods, it is thuspossible to reduce the effect of the biases presented above on the evaluation of faculty.Weaknesses The weaknesses of these articles are that they show inconsistent data rendered,methodology, conceptual frameworks, and even problematic assertions. The incompatibility ofthese studies makes it difficult to compare across articles and topics to create a cohesive pictureof SRs. For instance, considering validity of SRs, the literature cannot come up with consistentdefinitions of construct validity, which should consist of teaching effectiveness. However, theliterature is divided into different camps assessing small differences on this topic. If validityissues cannot be resolved, there is little hope for a cohesive construction of SRs in the future, asvalidity issues are the backbone of any conceptual framework and method of study. Beyond this,other weaknesses cannot be duly and fairly assessed until this problem is resolved. Given theunrelenting problems with these issues concerning SRs, it has become nearly impossible tofurther investigate smaller problems and contributors to the initial problem of the SRs. Conclusion The findings provide mixed support for the possibilities concerning the effect of biasingfactors on SRs. First, I considered expected grades. With contradictory results from Centra(2003), Griffin (2004), and Maurer (2006), there is a lot of conceptual work that needs to be done
Student Ratings Debate 12in order to find a common ground to make a decision about whether there is a biasing factor ofexpected grades on SRs. Perhaps this inconsistency is caused by some factor that has not yetbeen found. There is also the possibility that this inconsistency is due to varying definitions andmeasures of SR constructs, such as the actual items given to students on their SR forms. In orderto resolve this issue in the future, there will need to be a cohesive definition of expected grades, afirm conceptual basis that is in agreeable with both sides of the argument, and possibly even adifferent path to be followed that includes looking for other related causes that may appear to beexpected grades, but in all actuality is something else. Second, non-explicit behaviors have been popular to mention, but not as popular to studyin the recent literature. Studies point to these factors as being significantly related to theconstruct of teaching effectiveness. The behaviors of interest in this review were not justexpected grades, but also humor of the instructor and closeness of the instructor to the students.This suggests that teaching effectiveness is either not uni-dimensional as it has been portrayed inthe past, or there are just subcategories that must be considered and accounted for in the resultsof SRs. Third, recent years have shown the debate of the usefulness and utility of SRs has been ahotly debated issue. Once again, there are conflicting viewpoints. One side is embraced by thesupporters of SRs. Perhaps the most dominant of these proponents is Abrami (2001a; 2001b; forexamples), with support from others like Renaud and Murray (2005). Together, they feel thatSRs should be used, albeit with minor changes, for the betterment of the educational system. Onthe other side of the debate sits Olivares (2003), who feels that the inherent problems of SRs aretoo many and too problematic to fix; suggesting there should be other methods in place to assessteaching effectiveness. As Olivares (2003) put it, “data suggests that the institutionalization of
Student Ratings Debate 13SRTs [SRs] as a method to evaluate teacher effectiveness has resulted in students learning less inenvironments that have become less learning- and more consumer-oriented” (p. 243; emphasisin original). As the issue of utility persists, once again a conceptualization of what should beconsidered good or bad by definition needs to be established to determine whether SRs are worththe effort or a waste of time and resources to the institution. Each institution will need toestablish their need for SRs and whether they intend on using them in the future. Although I feeleach institution should use SRs to aid their faculty, the level to which these reports are used isultimately up to the institution. Lastly, studies by Schmelkin et al. (1997) and Spencer and Schmelkin (2002) indicatedthat the general feelings of both students and faculty concerning SRs are relatively positive. Theproblems persist that the feedback forms are often not explained to the faculty and thus providelittle aid in faculty development. The students also seem to have little reservation in completingSRs, yet they are uncertain of the effect that these surveys have on both the administration andthe faculty. Application It appears that in the past few years, there has been very little change in the literatureabout Student Ratings. These forms have taken a prominent position in institutions of all sizes,but there is a continued debate as to their usefulness or applicability. So what is next? Asstudent ratings seem to be here for the long run and have such a strong following at theinstitutional level, there needs to be some sort of codification that allows for SRs to be usefultools, as they were originally intended decades ago.
Student Ratings Debate 14 Once we are able to find a common ground for SRs, at least at an individual institutionlevel, these schools will be able to assess the utility of their own SRs, as well as make changes tothem in order to get the necessary information needed to assess their faculty. As proposed byScriven (1983; in Kulik 2001) and Theall and Franklin (2001), among others, the use ofsummative evaluations still seems to be among the best methods to assess faculty teachingeffectiveness. By bringing in outside observers, alumni ratings, and even interviews with thefaculty, it is possible to look at more “pieces of the puzzle” if you will, versus the inconsistentfindings of SRs. The issues with finding the best measures for one’s institution and assessing the utility ofthe measure vary drastically. Studies have illustrated that, other than the inconsistencies betweenpositive and negative findings, that there are issues with biasing factors, such as expected grades(Centra, 2003; Greenwald & Gillmore, 1997; Griffin, 2004), including non-explicit factors(Adamson et al., 2005; Safer et al., 2005), validity (Abrami, 2001a, 2001b; Olivares, 2003;Renaud & Murray, 2005), and preferences by the students and faculty (Schmelkin et al., 1997;Spencer & Schmelkin, 2002). Once these issues are resolved, or the institutions who choose touse SRs decide the emphasis placed on SRs given these issues, they can administer thesesurveys. After the institution has decided upon the measures they feel will assess the construct ofteaching effectiveness, they must communicate the results of these assessments clearly with theirfaculty. As Penny and Coe (2004) suggest, the communication and clarification of these resultsto the faculty is the only way to have increased certainty that these measures are being used totheir full potential.
Student Ratings Debate 15 ReferencesAbrami, P. C. (2001a). Improving judgments about teaching effectiveness: How to lie without statistics. New Directions for Institutional Research, 27 (5), 97-102.Abrami, P. C. (2001b). Improving judgments about teaching effectiveness using teacher rating forms. New Directions for Institutional Research, 27 (5), 59-87.Adamson, G., O’Kane, D., & Shelvin, M. (2005). Student’s ratings of teaching effectiveness: A laughing matter? Psychological Reports, 96, 225-226.Centra, J. A. (1972). The Student Instructional Report: Its Development and Uses, Educational Testing Services, Princeton, NJ.Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44, 495-518.Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis. Research in Higher Education, 13, 321-341.Greenwald, A. G., & Gillmore, G. M. (1997a). Grading lenience is a removable contaminant of student ratings. American Psychologist, 52, 1209-1217.Greenwald, A. G., & Gillmore, G. M. (1997b). No pain, no gain? The importance of measuring course workload in student ratings of instruction. Journal of Educational Psychology, 89, 743-751.Griffin, B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction. Contemporary Educational Psychology, 29, 410-425.Guthrie, E. R. (1954). The Evaluation of Teaching: A Progress Report. Seattle: University of Washington.
Student Ratings Debate 16Heckert, T. M., Latier, A., Ringwald, A., & Silvey, B. (2006). Relation of course, instructor, and student characteristics to dimensions of student ratings of teaching effectiveness. College Student Journal, 40, 195-203.Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New Directions for Institutional Research, 27 (5), 9-25.Maurer, T. W. (2006). Cognitive dissonance or revenge? Student grades and course evaluations. Teaching of Psychology, 33 (3), 176-179.Olivares, O. J. (2003). A conceptual and analytic critique of student ratings of teachers in the USA with implications for teacher effectiveness and student learning. Teaching in Higher Education, 8, 233-245Ory, J. C., & Ryan, K. (2001). How do student ratings measure of to a new validity framework? New Directions for Institutional Research, 27 (5), 27-44.Penny, A. R., & Coe, R. (2004). Effectiveness of consultation on student ratings feedback: A meta-analysis. Review of Educational Research, 74, 215-252.Renaud, R. D., & Murray, H. G. (2005). Factorial validity of student ratings of instruction. Research in Higher Education, 46, 929-953.Safer, A. M., Farmer, L. S. J., Segalla, A., & Elhoubi, A. F. (2005). Does the distance from the teacher influence student evaluation? Educational Research Quarterly, 28 (3), 28-35.Schmelkin, L. P., Spencer, K. J., & Gellman, E. S. (1997). Faculty perspectives on course and teacher evaluations. Research in Higher Education, 38, 575-592.Scriven, M. (1983). “Summative Teacher Evaluation.” In J. Milman (ed.), Handbook of Teacher Evaluation. Thousand Oaks, CA: Sage.
Student Ratings Debate 17Spencer, K. J., & Schmelkin, L. P. (2002). Student perspectives on teaching and its evaluation. Assessment & Evaluation in Higher Education, 27, 397-409.Theall, M., Abrami, P. C., & Mets, L. A. (2001). The student ratings debate: Are they valid? How can we best use them? New Directions for Institutional Research, 27 (5, Serial No. 109).Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction? New Directions for Institutional Research, 27 (5), 45-56.