American Accounting Association

AAA Home

Accounting Education News - Late Fall 1998
Faculty Development

Student Ratings of Teaching:
The Research Revisited

The following excerpts come from Bill Cashin’s IDEA Paper No. 32, from the Center for Faculty Evaluation & Development at Kansas State University. Information about obtaining the full text of IDEA Paper No. 32, including 67 references to related articles, can be found at the end of this article.

“Negative attitudes toward student ratings are especially resistant to change, and it seems that faculty and administrators support their belief in student-rating myths with personal and anecdotal evidence, which [for them] outweighs empirically based research evidence.”

(Cohen 1990, 124–125)

There are now more that 1,500 references dealing with research on student evaluations of teaching. This paper is an update of [IDEA Paper No. 20 that summarizes the research from 1971 to 1988]….This paper will attempt to summarize the conclusions of the major reviews of the student rating literature from Costin et al. (1971) to the present… Interested readers are encouraged to consult the various reviews and their individual references for details. For readers with less time, both Braskamp and Ory (1994) and Centra (1993) have chapters summarizing the student rating research.

The ERIC descriptor for student ratings is “student evaluation of teacher performance.” I suggest that the term “student ratings” is preferable to “student evaluations.” “Evaluation” has a definitive and terminal connotation; it suggests that we have an answer. “Rating” implies that we have data that need to be interpreted. Using “rating” rather than “evaluation” helps to distinguish between the people who provide the information (sources of data) and the people who interpret it in combination with other sources of data (evaluators)….

Writers on faculty evaluation are almost universal in recommending the use of multiple sources of data. No single source of data—including student-rating data—provides sufficient information to make a valid judgment about overall teaching effectiveness….

Multidimensionality
[Student rating forms are multidimensional, i.e., they] measure several different aspects of teaching. Put another way, no single student-rating item, nor set of related items, will be useful for all purposes.

Both Centra (1993) and Braskamp and Ory (1994) identify six factors commonly found in student rating forms:

  1. Course organization and planning
  2. Clarity, communication skills
  3. Teacher student interaction, rapport
  4. Course difficulty, workload
  5. Grading and examinations
  6. Student self-rated learning

…When interpreting student-rating data, we must distinguish among the various items and their dimensions to ensure that all of the appropriate dimensions are rated. Averaging dissimilar items is not appropriate.

Although there is general agreement that student ratings are multidimensional, and that various dimensions should be used when their purpose is to improve teaching, there is disagreement about how many, or which, dimensions should be used for personnel decisions. In several articles Abrami (1989a) suggested that one or a few global- or summary-type items might provide sufficient student-rating data for personnel decisions. Centra (1993) and Braskamp and Ory (1994) make a similar recommendation. Cashin and Downey (1992) tested this using the IDEA Overall Evaluation measure as the criterion…. Each of three global items—individually—accounted for at least 50 percent of the variance in the criterion measure: overall instructor effectiveness, 54 percent; overall course worth, 60 percent; overall amount learned, 69 percent;…Controlling for the students’ motivation to take the course, the size of the class, or the difficulty of the subject matter, did not add significantly to the amount of variance explained….

Reliability
…For student-rating items, reliability refers most often to consistency or interrater agreement (i.e., within a given class do the students tend to give similar ratings on a given item). Reliability varies depending upon the number of raters …the more raters, the more reliable. For example, with the IDEA system (Sixbury and Cashin 1995a), the median reliabilities (intraclass correlations) for the 38 items are:

For 10 raters . 69
For 20 raters . 83
For 30 raters . 88
For 40 raters . 91

Similar or high reliabilities are typically found with other well-designed forms. As a rule of thumb, I recommend that items with fewer than ten raters (reliabilities below .70) be interpreted with particular caution.

Stability is concerned with agreement between raters over time. In general, ratings of the same instructor tend to be similar over time (Braskamp and Ory 1994; Centra 1993). A longitudinal study (Overall and Marsh 1980) compared end-of-course ratings with ratings by the same students years later (at least one year after graduation). The average correlation was .83.

Generalizability is concerned with how confident we can be that our data accurately reflect the instructor’s general teaching effectiveness, not just [effectiveness] for a particular course that term. …[In a study of 1,364 courses, comparing instructors’ ratings when teaching the same and different courses (a correlation of .71 across courses)] Marsh (1932) concluded that the instructor, not the course, is the primary determinant of the student-rating items….

When making personnel decisions, we want to use the data to make judgments about the instructor’s general teaching effectiveness. When considering student ratings (remembering that we need other kinds of information beyond student ratings), the following seem to be reasonable rules of thumb. If the instructor teaches only one course, consistent ratings from two different terms may be sufficient. For most instructors, however, use ratings from a variety of courses, for two or more courses from every term for at least two years, totaling at least five courses. If there are fewer than 15 raters in any of the classes, data from additional classes are recommended.

Validity
In educational measurement, the basic question concerning validity is: does the test measure what it is supposed to measure? For student ratings this translates into: to what extent do student-rating items measure some aspect of teaching effectiveness? Unfortunately, there is not an agreed upon definition of “effectiveness teaching” or any single, all-embracing criterion. The best one can do is try various approaches, collecting data that either support or contest the conclusion that student ratings reflect effective teaching.

Approach One—Student Learning
…Other things being equal, the student of more effective teachers should learn more…. In the typical study, different instructors teach different sections of the same course, using the same syllabus and textbook, and most importantly using the same external final exam, i.e., an exam developed by someone other than the instructors. Cohen (1981) and Feldman (1989b) reviewed these studies. Using the students’ grades on the external exam as the measure of student learning, they examined correlations between the exam grade and various student-rating items. [Some] of the average correlations are given below:

Student ratings of 1981 1989b
Achievement or learning . 47 . 46
Overall course . 47
Overall instructor . 44
Teacher skill dimension . 50
Understandableness .56
Teacher availability .36
Encouraging discussion .36

Note on interpreting validity correlations: Earlier I suggested as a rule of thumb that reliability correlations of at least .70 (at least 10 raters) were desirable. However, in the social sciences validity correlations above .70 are unusual, especially if studying complex phenomena, such as student learning. As a rule of thumb, I suggest that student-rating validity correlations between .00 and .29, even when statistically significant, are not practically useful. Correlations between .30 and .49 are practically useful. Correlations between .50 and .70 are very useful but not common when studying complex phenomena.

Using this rule of thumb, the [correlations reported above] are generally useful. These relationships tend to support the validity of student ratings because the classes in which the students gave the instructor higher ratings tended to be the classes where the students learned more, i.e., scored higher on the external exam. On the other hand, the correlations are far from perfect, in part because many of the variables that relate to students’ learning will be related to student characteristics (e.g., motivation or ability), not to instructor characteristics.

Approach Two—Instructor’s Self Ratings
… In a review of the literature, Feldman (1989a) cites 19 studies which correlated instructor’s self ratings with student ratings. The average correlation was .29. However, in one study (Marsh et al. 1979) instructors were asked to rate two different courses in order to see if the course the instructor rated higher was also rated higher by the students. The median correlation—based on six factor scores between the instructor’s self ratings and the students’ ratings—was .49. In a later report (Marsh and Dunkin 1992) using nine factor scores, the median was .45. Such studies provide further support for the validity of student ratings.

Approach Three—The Ratings of Others
If one is willing to grant that the ratings of administrators, colleagues, alumni, and others have some validity—and, excepting alumni, that these ratings are independent of feedback from students—then student ratings share that validity.

Administrators’ Ratings—Student ratings correlate with administrators’ ratings, ranging from .47 to .62 (Kulik and McKeachie 1975), but Feldman (1989a), using global items, found a lower average correlation of .39.

Colleagues’ Ratings—Student ratings correlate with colleagues’ ratings, .48 to .69 (Kulik and McKeachie 1975); Feldman (1989a) found an average of .55. Marsh and Dunkin (1992) question the usefulness of colleagues’ ratings based on classroom visitation because such ratings tend to be unreliable.

Some faculty question whether the students have an appropriate conception of what effective teaching is. In a review of 31 studies, Feldman (1988) found that the students’ view of effective teaching was very similar to the faculty’s view (average correlation equaled .71)….

Alumni Ratings—Student ratings correlate with alumni ratings, .40 to .75 (Overall and Marsh 1980; Braskamp and Ory 1994). Feldman (1989a) found an average correlation of .69. This belies the conventional wisdom that the students will come to appreciate our teaching after they get into the real world as working adults.…

Variables Not Requiring Control
Despite widespread faculty concern, the research has uncovered relatively few variables that correlate with student ratings but are not related to instructional effectiveness. Generally the following variables tend to show little or no relationship to student ratings:

Age and teaching experience—in general, age and years of teaching experience are not correlated with student ratings. However, where small differences have been found, they tend to be negative, i.e., older faculty receive lower ratings (Feldman 1983). [Marsh and Hocevar (1991) conducted a] longitudinal study analyzing student ratings of the same instructors for as long as 13 years. They found no systematic changes over the years.

Gender of instructor—in a review of 14 laboratory or experimental studies, e.g., where students rated descriptions of fictitious teachers, Feldman (1992) found no differences in global ratings in the majority of studies, but in a few studies the male teachers received higher ratings. In a second review of 28 studies of actual ratings of real teachers reporting global ratings, he (Feldman 1993) found a very slight average difference in favor of women teachers (r = .02). However, a few studies raised the question of whether women faculty had to do more of what was being rated to obtain the same ratings as men.

Race—Centra (1993) points out that there have been hardly any studies of the race of the instructor. In a doctoral dissertation using IDEA, Li (1993) found no difference in the global ratings of Asian students compared to American students of their (presumably Caucasian) instructors….

Research productivity—has little correlation with student ratings (Centra 1993). In his review of the literature, Feldman (1987) found the average correlation between research productivity and overall teaching effectiveness items to be .12…a very low correlation [that] suggests that research productivity is indicative neither of good teaching nor bad teaching….

Usefulness of Student Ratings
Many faculty will grant the usefulness for improvement, preferring to rely on students’ open-ended comments. Cohen (1980) performed a meta-analysis of 17 studies of the effect of student-rating feedback on improving teaching. Receiving feedback about student ratings administered during the first half of the term was positively related to improving college teaching as measured by student ratings administered at the end of the term….

Conclusion, if an institution really intends to use student ratings to improve teaching, it needs to provide some kind of consultation [or early feedback] to the instructors.

Conclusion
There are probably more studies of student ratings than of all the other data used to evaluate college teaching combined. Although one can find individual studies that support almost any conclusion, for a number of variables there are enough studies to discern trends. In general, student ratings tend to be statistically reliable, valid, and relatively free from bias or the need for control, probably more so than any other data used for evaluation. Nevertheless, student ratings are only one source of data about teaching and must be used in combination with multiple sources of data if one wishes to make a judgment about all of the components of college teaching. Further, student ratings are data that must be interpreted. We should not confuse a source of data with the evaluators who use student-rating data—in combination with other kinds of data—to make their judgments about an instructor’s teaching effectiveness.

Full Citation for this Resource
Cashin, W. E. (1995). Student ratings of teaching: The research revisited. IDEA Paper No. 32. Manhattan, KS: Center for Faculty Evaluation and Development, Division of Continuing Education, Kansas State University. Full text is available at http://idea.ksu.edu.

The full list of references cited in this excerpt can be found on the AAA web site at AAA-edu.org (click on Publications) or in hardcopy by contacting the office at (941) 921-7747.