The Student Ratings Debate: Are They Valid? How Can We Best Use Them?

Student Ratings Debate 1

Running head: STUDENT RATINGS DEBATE

Monograph Critique

The Student Ratings Debate: Are They Valid? How Can We Best Use Them?

Matthew J. Hendrickson

Ball State University

ID 602: Institutional Research


Monograph Critique

The Student Ratings Debate: Are They Valid? How Can We Best Use Them?

Course evaluations have become commonplace for any student. These evaluations are

given every semester at the culmination of every course as a means to assess both the courses

and the professors. These student ratings go by many names, such as teacher rating forms

(TRFs), teacher course evaluations (TCEs), student ratings of teaching effectiveness (SRTEs), or

student evaluations of teaching (TRFs, TCEs, SRTEs, and SETs, respectively; Theall, Abrami, &

Mets, 2001). Concerning the rest of this paper, I will refer to these evaluations as student

ratings, or SRs. The instructions on these SRs are usually poor as to the completion of the

surveys and many of the students do not take them seriously. This could be caused by many

factors, but the ones shown above are just symptoms of the problems listed below.

Summary

From the inception of student ratings at the University of Washington and a close second

with Purdue University publishing the first research study on these ratings, there has been

continual debate (Kulik, 2001). The original purpose of student ratings (SRs) consisted of aiding

administrators in the monitoring of teaching quality and to help faculty improve their teaching

(Guthrie, 1954; in Kulik, 2001). Today, many of the purposes of the ratings focus on the hiring

of new faculty, annual reviews of current faculty, promotion and tenure decisions, school

accreditation reviews, selection of teachers for teaching awards, assigning teachers to courses, as

well as many others (Kulik, 2001). When searching out ways to evaluate faculty, these ratings

seem to have taken precedent and are often the strongest force, if not the only determinant, of

faculty reviews, (Abrami, 2001b). These reviews typically concern tenure positions and salary

increases (Ory & Ryan, 2001).


Part I-Summarizing the Evidence

Other types of evaluations. Scriven (1983; in Kulik, 2001; Theall & Franklin, 2001)

points out many other methods to be considered when assessing faculty. He uses the summative

approach of evaluation, which combines student learning, expert visits to the classroom, and

alumni surveys. These are in line with those presented Kulik (2001) as the four most credible of

the indicators of effectiveness: student learning, student comments, alumni ratings, and ratings of

teaching by outside observes. Kulik (2001) correlated SRs and these four measures. The

findings suggest that the average correlation between SRs and learning as depicted by teacher

ratings was .43 (moderate/high; Cohen, 1981; in Kulik, 2001). SRs and student comments

correlated .81 from interviews and .94 from written comments (very high; Ory, Braskamp, &

Pieper, 1908; in Kulik, 2001). Alumni ratings correlated with SRs at .69 in a cross sectional

design (high; Feldman, 1989; in Kulik, 2001) and .83 after 1.4 years at alumni status compared

to original ratings by that student (Overall & Marsh, 1980; in Kulik, 2001). Lastly, the

correlation of SRs and observer ratings were .5 (high; Feldman, 1989; in Kulik, 2001).

Rating issues. All of the following issues have once been considered as possible

grievances and have long since been discounted. Even so, they are important to note as those not

familiar with the area may also posit these questions (Kulik, 2001). High ratings do not

necessarily imply low learning, rather that teachers who attain high ratings are the ones who

impart the most knowledge to the students. Showmanship is not rewarded by high ratings, as

students are able to tell whether a faculty member knows the material. Body language has been

brought up as a concern, but no conclusive evidence has been shown that the more animated a

teacher, the more learning occurs. The teacher may be liked better, but that does not necessarily

reflect the amount of learning in a class as measured by SRs. High ratings have also been shown


to not be a predictor of lenient grading, as many classes get low ratings, even when the “given”

grades are higher than usual. The idea that increased vocal expressiveness has also been shown

to have little impact on student learning. Again, this change may make the course or instructor

more favorable or enjoyable, but it does not have any bearing on the student ratings of the actual

course content or assessment of the teacher.

Validity

As with any type of measure, validity questions arise and are continually debated. The

issues concerning validity primarily rotate on the conception of whether the measure assesses the

construct in question (construct validity). To do this, there must be clear definitions of the

construct and any pertinent conditions. Another issue to consider is whether the construct can

actually be measured in the form the measure is presented, in this case we are concerning student

evaluation forms of faculty.

Construct validity. Messick (1989; in Ory & Ryan, 2001) suggested six aspects to

construct validity. These aspects consist of content, substantive, structural, external,

generalizability, and consequential elements. Essentially, all of these aspects combine to form a

unified idea of construct validity that assesses whether the construct truly measures what it was

deigned to measure. Concerning construct validity, there are two main threats, construct under-

representation and construct-irrelevant variance. Construct under-representation is when the

construct is too specific in its definition as to not include all of the critical dimensions of the

construct. Construct-irrelevance is a wide definition that includes many extraneous variables

that are not related to the construct.

Collecting evidence and establishing construct validity. There have been five major

types of studies in the past to determine if SRs can be considered valid measures of teaching


quality (Abrami, d’Apollonia, & Cohen, 1990; in Ory & Ryan, 2001). These five types are

multisection, multitrait-multimethod, bias, laboratory, and dimensionality; all of which have

provided the data necessary to collect the essential evidence for the six aspects mentioned above.

There has been substantial evidence provided for each of these six aspects, now there needs to be

a consensus on the definitions and constructs. Concerning content aspects, we are to decide what

exactly is “teaching” (Kulik, 2001)? Without a clear definition, how can we assess whether it is

truly being measured? Substantive aspects point to questions regarding whether the SR process

actually fits the construct being measured (e.g. critical thinking). Determining whether the item

inter-relationships correspond to the construct deals with the structural aspect of construct

validity. Simply because items correlate together does not mean that they are measuring what

was intended, it just means they are measuring the same things. External aspects focus on

whether the measures produced similar to expected results, particularly based on previous

findings. A common problem among research is whether the findings are able to be generalized,

which focuses on the subjects used. Therefore, which level of education should you base

“achievement” at, undergraduate, major courses, etc. (Centra, 1993; in Ory & Ryan, 2001)?

Lastly, the consequential aspect deals with the positive and negative, as well as intended and

plausible outcomes on the faculty dependent on their ratings. One should be aware of the

implications of their research when designing a study or measure.

Bias. Moving past the issues of validity, Theall and Franklin (2001) do not feel that these

are true problems. Rather, it is felt that the use of data found is biased. “Even when the data are

technically rigorous, one of the major problems is day-to-day practice: student ratings are often

misinterpreted, misused, and not accompanied by other information that allows users to make

sound decisions” (Theall & Franklin, 2001, p. 46). This misunderstanding presents many


problems at all levels, from the students filling out the SRs, but more specific to this chapter, to

the faculty and administration and their inability to use the data found correctly. Theall and

Franklin (2001) suggest that rather than these individuals attempting to refute the validity and

reliability of these measures, perhaps they should try to understand them more fully, as well as

understand how they work. They feel that when this occurs, many of the problems will begin to

dissipate.

Rating myths. This discourse begs to question whether students are qualified to rate their

instructors and the instruction received. As the students spend the most time in interaction with

that specific faculty member, they are most likely to correctly evaluate their experiences.

Popularity has been thought to have an effect on ratings, although it cannot be wholly refuted, it

also cannot be confirmed at this point. Questioning whether ratings are related to learning is a

curious idea in that there is no real definition of learning available. The research suggests

(Cohen, 1981; in Kulik, 2001) that there appears to be a link between ratings and grades. It is

further illustrated that this could be due to a number of factors, such as student engagement

rather than due to possibilities such as the “easy grader.” Another problem is whether students

are able to make accurate judgments about their experiences while they are still in school.

Research suggests that SRs remain constant for periods up to thirteen years (Marsh, 1992; in

Theall & Franklin, 2001). SRs have been show to be reliable in that they are “remarkably

consistent” (Theall & Franklin, 2001, p. 50). There appears to be no gender bias in responses in

SRs. Although research indicates that SRs are generally not affected by situational variables, it

is important to note that there will always be variations in studies. Lastly, the notion that

students rate teachers based on expected or given grades does not seem to hold true. As noted

earlier, good grades tend to follow with good reviews, but this is consistent with the learning-


satisfaction relationship. This relationship suggests that good teaching renders learning, which

encourages student achievement, which increases satisfaction, which in turn creates ratings that

reflect this sequence.

Suggested guidelines. Theall and Franklin (2001) propose a set of guidelines that are

intended to create a good evaluation practice. They are (p. 52-54):

1. Establish the purpose of the evaluation and the uses and users of ratings beforehand.

2. Include all stakeholders in decisions about evaluation process and policy.

3. Publicly present clear information about the evaluation criteria, process, and procedures.

4. Produce reports that can be understood easily and accurately.

5. Educate the users of ratings results to avoid misuse and misinterpretation.

6. Keep a balance between individual and institutional needs in mind.

7. Include resources for improvement and support of teaching and teachers.

8. Keep formative evaluation confidential and separate from summative decision making.

9. Adhere to rigorous psychometric and measurement principles and practices.

10. Regularly evaluate the evaluation system.

11. Establish a legally defensible process and a system for grievances.

12. Consider the appropriate combination of evaluation data with assessment and institutional

research information.

Use

SRs are meant to improve teaching in two important ways, effecting the institutional level

and effects on the teachers themselves (Kulik, 2001). These effects can be seen by institutional

hiring, promotion, tenure, and course assignment decisions. Effects on teachers can be seen by

getting feedback by those who are affected most by the teacher’s style, the students. By using

the summative approach (Scriven, 1983), better assessment may take place to make institutional

reforms or to improve the quality of the current faculty.


Part II-Suggestions for New Methodologies

SRs and summative decisions.

Improving the decision-making process. Remembering the information presented about

validity, we must consider this quote concerning the caution of SRs being used as the only

determinant of faculty ratings. “It cannot be emphasized strongly enough that the evaluation

questionnaires of the type we are discussing here measure only the attitudes of students towards

(sic) the class and instructor. They do not measure the amount of learning which has taken

place” (Canadian Association of University Teachers, 1991, p. 1; in Abrami, 2001b). Thus, the

summative approach of using all of the available resources is once again recommended as the

best method for making such important decisions regarding faculty (Abrami, 2001b). By

implementing the statistical hypothesis method, it is thus possible to calculate these

recommended summative decisions regarding promotion decisions of faculty. An important

distinction to make is that of norm- versus criterion-referenced evaluations. Norm-referenced

evaluations deal with how the individual faculty compare to an appropriate collection of faculty,

much like a referent peer group. Criterion-referenced evaluations concern how individual

faculty compare to a pre-determined standard of excellence.

SR validity estimates into decision process. As the classic measurement theory states, we

seek to find the true score, or the hypothetical value that best represents our construct. As this is

impossible, we must use obtained scores, or those that we can actually measure. The obtained

score is the true score + error score (Abrami, 2001b), where error = |true score – obtained score|.

Thus, the SR scores = teaching effectiveness + error. Next, the reliability coefficient is

calculated to estimate the relationship between true and obtained scores. The larger this


relationship, the less error, thus the less variance, which allows one to be more certain that they

are measuring their construct rather than some extraneous variable.

Abrami (2001b) proposes nine recommendations for improving judgments about teaching

effectiveness using SRs (p. 83). A quick summary of these points is provided: 1) report the

average of several global items, 2) combine the results of each faculty member’s courses, 3)

decide in advance the policy excluding SR scores, 4) choose between norm- and criterion-

referenced evaluation, 5) follow the steps in statistical hypothesis testing, 6) provide descriptive

and inferential statistics visually, 7) incorporate SR validity estimates into statistical tests and

confidence intervals, and 9) decide whether and to what extent to weigh sources of evidence

other than SRs.

Theall (2001) points out the issues seen with Abrami’s (2001b) comments. He argues

whether the use of Abrami’s proposal is feasible on a day-to-day basis. These suggestions were

good, set good limits, and were scholarly and rigorous. However, these methods may be too

tough or time consuming for untrained evaluators, or even for general use. Beyond this, will this

added level of precision yield better results? In short, is the benefit worth the extra work?

Considering the actual individuals who receive the data, are the data readable, or are the targets

trained well enough to interpret that information? Lastly, if one buys into Abrami’s proposition,

they agree that there “is an established, significant, and meaningful relationship between ratings

and learning and because of this, precision in the presentation of information about ratings

results is very important, necessary, and useful” (Theall, 2001, p. 95). Heading the next

paragraph, “we know that we are accurately and precisely measuring something, but is it purely

the quality of teaching that we are measuring” (Theall, 2001, p. 95)?


In an attempt to reconcile these issues, Abrami (2001a) imposes that SRs do not provide a

perfect estimator of instructor impacts on student learning and other criteria for effective

teaching, rather they provide general discussions. This imprecision may be known to researchers

and administrators, but it is not always made aware to the reader or accounted for in practice.

Due to this, we often have problems with summative evaluations and their proper use. Abrami

(2001a) also suggests that these statistical methods are simply recommendations to aid in

eliminating the human bias in either being too strict or lenient when it comes to SRs.

Evaluation

This monograph focused heavily on the conceptualization of making SRs very useful and

valid. These seemed, however, to be point-counterpoint arguments. The ongoing debate rotated

between showing how the SRs are useful, but then it was immediately shot down because of

validity issues, which a retort came back showing that these were not true issues, etc. until the

end of the review.

Part I of the monograph consisted of three chapters constructed to summarize the

previous research and evidence for the applicability of SRs. The first chapter put forth the ideas

that SRs were very useful, but they needed to be used in conjunction with other measures in

order to be used to their prime potential (Kulik, 2001). Chapter two posited that other issues,

such as construct validity and its sub-categories, needed to be assessed in order to determine if

the SRs were even capable of being useful (Ory & Ryan, 2001). The next chapter finished part I

of the monograph with the biasing factors that have been involved, either intentionally or

incidentally, in SR research (Theall, & Franklin, 2001).

The second half of the monograph contained three chapters that suggested alternate

methodologies to improve the current SRs and their uses. Chapter four focused on the five


issues surrounding SRs concerning summative decisions (Abrami, 2001b). Chapter five

questioned whether improved models are practical on a day-to-day basis (Theall, 2001). The

volume culminated in a rejoinder by Abrami (2001a) where he argues that although these

recommendations may seem imposing at first, their usefulness and positive implications will

prove beneficial after some training and understanding.

Strengths.

Many of the arguments put forth by Abrami (2001b primarily) were very strong. He

appeared to have a clear understanding of the current issues with the literature and put much

effort into developing new methods to reduce or eliminate many of these imposing factors. It

seemed that he was put down on many of these new ideas, where no new ones were offered by

his contemporaries.

Another strong point was that many of the authors, although often in disagreement, were

able to agree on a general set of problems with SRs. This set included topics such as validity

issues and a general support for summative ratings. It seems that although this area is still

largely debated, there is a general consensus regarding the overall use of SRs.

Weaknesses.

One of the peculiarities of chapter three concerns the assertion that in the face of a lack of

evidence, unsubstantiated claims are automatically rejected (Theall & Franklin, 2001). Although

this may be an acceptable idea, it should not be a logical basis for a complete rejection of the

idea. The application of scientific rigor to topics such as these should warrant further looks into

this concept, rather than a full and immediate rejection. This is primarily in reference to ratings

based on popularity (Theall & Franklin, 2001, p. 49).


Although I felt this monograph to be interesting, it seemed to focus on some moot points.

Perhaps this is due to the level of training of entry level professionals, or possibly those who are

not part of as rigorously based research program. These possibilities provided seemed to dwell

on points that were made repeatedly, such as validity. Granted, these issues are important, but

the seemed to be stated in every chapter, and reiterated upon, with nothing more being added.

Another weakness, although on a slightly different plane, is that of the individuals who

are holding on to those threads of bias that have been largely disproven. In the cases where the

biases have not been disproven, the studies that suggest support for these biases have been shown

to be invalid. At a bare minimum, these biases have had no support since their creation.

Conclusion

The research on SRs has shown that student ratings agree well with the differing

measures of teaching effectiveness and that SRs are valuable to teachers (Kulik, 2001). It has

been illustrated in the monograph and above that the original intention of SRs are to aid

administrators in the monitoring of teaching quality and to help faculty improve their teaching.

Thus, the current methods provide feedback, but it is often incomplete and unclear. When this

occurs, the results are no longer serving their purpose.

We know that SRs have become one of the strongest determinants in faculty evaluation,

although they have many downfalls. For instance, arguments concerning validity issues,

particularly construct validity, and biases play large roles in the types of ratings faculty get. The

ongoing debate on whether extraneous variables, for example high ratings leading to easy

grading, will continue in the future. It is interesting to note that many of these issues have been

resolved and disproven, but in true human fashion, we are slow to let go of our intuitions and

beliefs.


There is still a long way to go to find the “perfect” construct for measuring teacher

effectiveness and student learning. By perfect, I mean finding the constructs that have the least

amount of error, and thus assesses the concepts of teacher effectiveness and student learning.

However, as we continue to analyze the current methods, there have been great strides in

attempting to find the best way to analyze teachers (SRs, outside ratings, alumni ratings,

summative methods, etc.). Concerning these measures, there is also that small group, including

Abrami, which are looking to find new and more accurate methods of analyzing faculty ratings.

Like any other scientific endeavor, the past is clear, the mistakes we have made seem to stand

out, but the future has yet to be discovered.

Application

The issues are very pertinent to Institutional Research offices, as practically every office

collects and uses SR data on a semsterly basis. Given this case and the prevailing fact that there

are very few IR programs, or individuals that even know about IR, knowing how to analyze and

use SRs is very important. Consider that these SRs, as illustrated above, are such an imposing

force on faculty. Combine this with the knowledge that many professionals and even the faculty

do not understand the underlying theory behind how to use SRs. This knowledge is disturbing

on the one hand that so many individuals reputations and careers rest on the balance of these

measures. On a positive note, however, the validity and reliability of these scales have proven to

be high and very stable.

In order to avoid grievous mistakes in the future, this working knowledge of SRs must be

understood or changed. By either implementing training programs for SRs, or by providing

more detailed and explanative results documents, many of the background issues of SRs will be


resolved. Having a better trained staff will also provide other benefits in the IR department as a

whole, not just pertaining to SRs.

The other problems still persist, however, until we can correct the SRs themselves. Some

of the alternative methods seemed to have usefulness, but as we have discovered, IR offices like

to have continuity. This continuity may be a downfall as to change it would mean that those who

changed their ratings and those who did not change their ratings would not be able to compare

data. Perhaps a short term solution is to have a few of the larger and more predominant IR

offices use both methods as a trial run to see the potential outcomes. Given the results of this

process, the new methods may take hold. In the event that the new methods fail, the traditional

method would still be available as a back up plan. If no one is willing to take a chance, no one

will ever know the benefits, or detriments, of these new methods.


References

Abrami, P. C. (2001a). Improving judgments about teaching effectiveness: How to lie without

statistics. New Directions for Institutional Research, 27 (5), 97-102.

Abrami, P. C. (2001b). Improving judgments about teaching effectiveness using teacher rating

forms. New Directions for Institutional Research, 27 (5), 59-87.

Abrami, P. C., d’Appolonia, S., & Cohen, P. A. (1990). The validity of student ratings of

instruction: What we know and what we don’t. Journal of Educational Psychology, 82,

219-231.

Canadian Association of University Teachers, Academic Freedom and Tenure Committee.

(1998). Policy on the Use of Anonymous Student Questionnaires in the Evaluation of

Teaching. Ottawa: Canadian Association of University Teachers.

Centra, J. A. (1993). Reflective Faculty Evaluation: Enhancing Teaching and Determining

Faculty Effectiveness. San Francisco: Jossey-Bass.

Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis.

Research in Higher Education, 13, 321-341.

Feldman, K. A. (1989). Instructional effectiveness of college teachers as judged by teachers

themselves, current and former students, colleagues, administrators, and external

(neutral) observers. Research in Higher Education, 30, 137-194.

Guthrie, E. R. (1954). The Evaluation of Teaching: A Progress Report. Seattle: University of

Washington.

Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New Directions for

Institutional Research, 27 (5), 9-25.


Marsh, H. W. (1992). A longitudinal perspective of student evaluations of university teaching:

Ratings of the same teachers over a thirteen-year period. Paper presented at the 73rd

Annual Meeting of the American Educational Research Association, San Francisco, Apr.

1992.

Messick, S. (1989). Validity. In R. L. Linn (ed.), Educational Measurement. (3rd ed.) Old

Tappan, NJ: Macmillan.

Scriven, M. (1983). “Summative Teacher Evaluation.” In J. Milman (ed.), Handbook of

Teacher Evaluation. Thousand Oaks, CA: Sage.

Theall, M., Abrami, P. C., & Mets, L. A. (2001). The student ratings debate: Are they valid?

How can we best use them? New Directions for Institutional Research, 27 (5, Serial No.

109).

Theall, M. (2001). Can we put precision into practice? Commentary and thoughts engendered

by Abrami’s “Improving judgments about teaching effectiveness using teacher rating

forms.” New Directions for Institutional Research, 27 (5), 89-96.

Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or

a witch hunt in student ratings of instruction? New Directions for Institutional Research,

27 (5), 45-56.

Ory, J. C., Braskamp, L. A., & Pieper, D. M. (1980). The congruency of student evaluative

information collected by three methods. Journal of Educational Psychology, 72, 181-

185.

Ory, J. C., & Ryan, K. (2001). How do student ratings measure of to a new validity framework?

New Directions for Institutional Research, 27 (5), 27-44.


Overall, J. U., & Marsh, H. W. (1980). Student’s evaluations of instruction: A longitudinal study

of their stability. Journal of Educational psychology, 72, 321-325.

The Student Ratings Debate: Are They Valid? How Can We Best Use Them?

More Related Content

What's hot

Similar to The Student Ratings Debate: Are They Valid? How Can We Best Use Them?

More from Matthew Hendrickson

The Student Ratings Debate: Are They Valid? How Can We Best Use Them?