|
Student
Ratings of Teaching:
The Research Revisited
The
following excerpts come from Bill Cashins IDEA Paper
No. 32, from the Center for Faculty Evaluation &
Development at Kansas State University. Information about
obtaining the full text of IDEA Paper No. 32, including 67
references to related articles, can be found at the end of
this article.
Negative
attitudes toward student ratings are especially
resistant to change, and it seems that faculty and
administrators support their belief in student-rating
myths with personal and anecdotal evidence, which [for
them] outweighs empirically based research evidence.
(Cohen
1990, 124125)
There are now
more that 1,500 references dealing with research on
student evaluations of teaching. This paper is an update
of [IDEA Paper No. 20 that summarizes the research from
1971 to 1988]
.This paper will attempt to summarize
the conclusions of the major reviews of the student rating
literature from Costin et al. (1971) to the present
Interested readers are encouraged to consult the various
reviews and their individual references for details. For
readers with less time, both Braskamp and Ory (1994) and
Centra (1993) have chapters summarizing the student rating
research.
The ERIC
descriptor for student ratings is student evaluation
of teacher performance. I suggest that the term student
ratings is preferable to student evaluations.
Evaluation has a definitive and terminal
connotation; it suggests that we have an answer. Rating
implies that we have data that need to be interpreted.
Using rating rather than evaluation
helps to distinguish between the people who provide the
information (sources of data) and the people who interpret
it in combination with other sources of data (evaluators)
.
Writers on
faculty evaluation are almost universal in recommending
the use of multiple sources of data. No single
source of dataincluding student-rating dataprovides
sufficient information to make a valid judgment about
overall teaching effectiveness
.
Multidimensionality
[Student rating forms are multidimensional, i.e., they]
measure several different aspects of teaching. Put another
way, no single student-rating item, nor set of related
items, will be useful for all purposes.
Both Centra
(1993) and Braskamp and Ory (1994) identify six factors
commonly found in student rating forms:
- Course
organization and planning
- Clarity,
communication skills
- Teacher
student interaction, rapport
- Course
difficulty, workload
- Grading and
examinations
- Student
self-rated learning
When
interpreting student-rating data, we must distinguish
among the various items and their dimensions to ensure
that all of the appropriate dimensions are rated.
Averaging dissimilar items is not appropriate.
Although there
is general agreement that student ratings are
multidimensional, and that various dimensions should be
used when their purpose is to improve teaching,
there is disagreement about how many, or which, dimensions
should be used for personnel decisions. In several
articles Abrami (1989a) suggested that one or a few
global- or summary-type items might provide sufficient
student-rating data for personnel decisions. Centra
(1993) and Braskamp and Ory (1994) make a similar
recommendation. Cashin and Downey (1992) tested this using
the IDEA Overall Evaluation measure as the criterion
.
Each of three global itemsindividuallyaccounted
for at least 50 percent of the variance in the criterion
measure: overall instructor effectiveness, 54 percent;
overall course worth, 60 percent; overall amount learned,
69 percent;
Controlling for the students
motivation to take the course, the size of the class, or
the difficulty of the subject matter, did not add
significantly to the amount of variance explained
.
Reliability
For student-rating items, reliability refers most
often to consistency or interrater agreement
(i.e., within a given class do the students tend to give
similar ratings on a given item). Reliability varies
depending upon the number of raters
the more raters,
the more reliable. For example, with the IDEA system
(Sixbury and Cashin 1995a), the median reliabilities
(intraclass correlations) for the 38 items are:
| For 10
raters |
. 69 |
| For 20
raters |
. 83 |
| For 30
raters |
. 88 |
| For 40
raters |
. 91 |
Similar or high
reliabilities are typically found with other well-designed
forms. As a rule of thumb, I recommend that items with
fewer than ten raters (reliabilities below .70) be
interpreted with particular caution.
Stability
is concerned with agreement between raters over time.
In general, ratings of the same instructor tend to be
similar over time (Braskamp and Ory 1994; Centra 1993). A
longitudinal study (Overall and Marsh 1980) compared
end-of-course ratings with ratings by the same students
years later (at least one year after graduation). The
average correlation was .83.
Generalizability
is concerned with how confident we can be that our data
accurately reflect the instructors general
teaching effectiveness, not just [effectiveness] for a
particular course that term.
[In a study of 1,364
courses, comparing instructors ratings when teaching
the same and different courses (a correlation of .71
across courses)] Marsh (1932) concluded that the
instructor, not the course, is the primary determinant of
the student-rating items
.
When making
personnel decisions, we want to use the data to make
judgments about the instructors general
teaching effectiveness. When considering student ratings
(remembering that we need other kinds of information
beyond student ratings), the following seem to be
reasonable rules of thumb. If the instructor teaches only
one course, consistent ratings from two different terms
may be sufficient. For most instructors, however, use
ratings from a variety of courses, for two or more
courses from every term for at least two years, totaling
at least five courses. If there are fewer than 15 raters
in any of the classes, data from additional classes are
recommended.
Validity
In educational measurement, the basic question concerning
validity is: does the test measure what it is supposed to
measure? For student ratings this translates into: to what
extent do student-rating items measure some aspect of
teaching effectiveness? Unfortunately, there is not an
agreed upon definition of effectiveness teaching
or any single, all-embracing criterion. The best one can
do is try various approaches, collecting data that either
support or contest the conclusion that student ratings
reflect effective teaching.
Approach OneStudent
Learning
Other things being equal, the student of more
effective teachers should learn more
. In the typical
study, different instructors teach different sections of
the same course, using the same syllabus and textbook, and
most importantly using the same external final
exam, i.e., an exam developed by someone other
than the instructors. Cohen (1981) and Feldman (1989b)
reviewed these studies. Using the students grades on
the external exam as the measure of student learning, they
examined correlations between the exam grade and various
student-rating items. [Some] of the average correlations
are given below:
| Student
ratings of |
1981 |
1989b |
| Achievement
or learning |
.
47 |
.
46 |
| Overall
course |
.
47 |
|
| Overall
instructor |
.
44 |
|
| Teacher
skill dimension |
. 50 |
|
| Understandableness |
|
.56 |
| Teacher
availability |
|
.36 |
| Encouraging
discussion |
|
.36 |
Note on
interpreting validity correlations: Earlier I
suggested as a rule of thumb that reliability
correlations of at least .70 (at least 10 raters)
were desirable. However, in the social sciences validity
correlations above .70 are unusual, especially if
studying complex phenomena, such as student learning. As a
rule of thumb, I suggest that student-rating validity
correlations between .00 and .29, even when statistically
significant, are not practically useful. Correlations
between .30 and .49 are practically useful. Correlations
between .50 and .70 are very useful but not common when
studying complex phenomena.
Using this rule
of thumb, the [correlations reported above] are generally
useful. These relationships tend to support the validity
of student ratings because the classes in which the
students gave the instructor higher ratings tended to be
the classes where the students learned more, i.e.,
scored higher on the external exam. On the other hand, the
correlations are far from perfect, in part because many of
the variables that relate to students learning will
be related to student characteristics (e.g.,
motivation or ability), not to instructor characteristics.
Approach TwoInstructors
Self Ratings
In a review of the literature, Feldman (1989a)
cites 19 studies which correlated instructors self
ratings with student ratings. The average correlation was
.29. However, in one study (Marsh et al. 1979) instructors
were asked to rate two different courses in order
to see if the course the instructor rated higher was also
rated higher by the students. The median correlationbased
on six factor scores between the instructors
self ratings and the students ratingswas .49.
In a later report (Marsh and Dunkin 1992) using nine
factor scores, the median was .45. Such studies provide
further support for the validity of student ratings.
Approach
ThreeThe Ratings of Others
If one is willing to grant that the ratings of
administrators, colleagues, alumni, and others have some
validityand, excepting alumni, that these ratings
are independent of feedback from studentsthen
student ratings share that validity.
Administrators
RatingsStudent ratings correlate with
administrators ratings, ranging from .47 to .62
(Kulik and McKeachie 1975), but Feldman (1989a), using
global items, found a lower average correlation of .39.
Colleagues
RatingsStudent ratings correlate with colleagues
ratings, .48 to .69 (Kulik and McKeachie 1975); Feldman
(1989a) found an average of .55. Marsh and Dunkin (1992)
question the usefulness of colleagues ratings based
on classroom visitation because such ratings tend to
be unreliable.
Some faculty
question whether the students have an appropriate
conception of what effective teaching is. In a review of
31 studies, Feldman (1988) found that the students
view of effective teaching was very similar to the facultys
view (average correlation equaled .71)
.
Alumni
RatingsStudent ratings correlate with alumni
ratings, .40 to .75 (Overall and Marsh 1980; Braskamp and
Ory 1994). Feldman (1989a) found an average correlation of
.69. This belies the conventional wisdom that the students
will come to appreciate our teaching after they get into
the real world as working adults.
Variables
Not Requiring Control
Despite widespread faculty concern, the research has
uncovered relatively few variables that correlate with
student ratings but are not related to
instructional effectiveness. Generally the following
variables tend to show little or no relationship
to student ratings:
Age and
teaching experiencein general, age and years of
teaching experience are not correlated with student
ratings. However, where small differences have been found,
they tend to be negative, i.e., older faculty receive lower
ratings (Feldman 1983). [Marsh and Hocevar (1991)
conducted a] longitudinal study analyzing student ratings
of the same instructors for as long as 13 years. They
found no systematic changes over the years.
Gender of
instructorin a review of 14 laboratory or
experimental studies, e.g., where students rated
descriptions of fictitious teachers, Feldman (1992) found
no differences in global ratings in the majority of
studies, but in a few studies the male teachers received
higher ratings. In a second review of 28 studies of actual
ratings of real teachers reporting global ratings, he
(Feldman 1993) found a very slight average difference in
favor of women teachers (r = .02). However, a few studies
raised the question of whether women faculty had to do
more of what was being rated to obtain the same
ratings as men.
RaceCentra
(1993) points out that there have been hardly any studies
of the race of the instructor. In a doctoral dissertation
using IDEA, Li (1993) found no difference in the global
ratings of Asian students compared to American students of
their (presumably Caucasian) instructors
.
Research
productivityhas little correlation with student
ratings (Centra 1993). In his review of the literature,
Feldman (1987) found the average correlation between
research productivity and overall teaching effectiveness
items to be .12
a very low correlation [that]
suggests that research productivity is indicative neither
of good teaching nor bad teaching
.
Usefulness
of Student Ratings
Many faculty will grant the usefulness for improvement,
preferring to rely on students open-ended comments.
Cohen (1980) performed a meta-analysis of 17 studies of
the effect of student-rating feedback on improving
teaching. Receiving feedback about student ratings
administered during the first half of the term was positively
related to improving college teaching as measured by
student ratings administered at the end of the term
.
Conclusion, if
an institution really intends to use student ratings to
improve teaching, it needs to provide some kind of
consultation [or early feedback] to the instructors.
Conclusion
There are probably more studies of student ratings than
of all the other data used to evaluate college teaching
combined. Although one can find individual studies that
support almost any conclusion, for a number of variables
there are enough studies to discern trends. In general,
student ratings tend to be statistically reliable, valid,
and relatively free from bias or the need for control,
probably more so than any other data used for evaluation.
Nevertheless, student ratings are only one source of data
about teaching and must be used in combination with
multiple sources of data if one wishes to make a judgment
about all of the components of college teaching. Further,
student ratings are data that must be interpreted. We
should not confuse a source of data with the evaluators
who use student-rating datain combination with other
kinds of datato make their judgments about an
instructors teaching effectiveness.
Full
Citation for this Resource
Cashin, W. E. (1995). Student ratings of teaching: The
research revisited. IDEA Paper No. 32. Manhattan, KS:
Center for Faculty Evaluation and Development, Division of
Continuing Education, Kansas State University. Full text
is available at http://idea.ksu.edu.
The
full list of references cited in
this excerpt can be found on the AAA web site at
AAA-edu.org (click on Publications) or in hardcopy by
contacting the office at (941) 921-7747. |