Scaled scores in the classroom: Use or misuse?

A friend of mine is a professor at a top-tier university. As with most universities, large or small, top-tier or third-ranked, grade inflation is a problem. Harvard University, for example, has reported that most of its students have an A average in their classes. The problem is repeating itself all over the country.

At the university where my friend teaches, the administration has decreed that the grades for each class must have a B average, or a 3.0 on a 4.0 scale. This means that, if all the grades are As, Bs, and Cs, there must be an equal number of As and Cs (though this is an oversimplification). The only way to accomplish this is to normalize, or scale, the scores to define the median score as a B. Normalization puts the scores into a frequency pattern that attempts to mimic the traditional bell-shaped curve, with half of the scores above the middle, and half the scores below it.

In theory, scaling scores is one way to solve the problem of grade inflation. There are several assumptions that must be made in order for this to work.

The first assumption is that the students are average, or fall in a normal distribution (bell-shaped curve). In a highly selective school, for example (such as the one where my friend teaches), the students are not distributed on a bell-shaped curve for their abilities. If we were to map the IQ or SAT or GRE scores for a group of students at such a school, the resulting curve would be heavily skewed to the right. So the problem facing the professor is how to discriminate between a group of students who are all performing exceptionally well.

The next assumption is related to the first one. To scale scores, it is necessary to assume that the students’ performance had a normal distribution (they were spread out evenly, in something close to a bell-shaped curve). If the students did unusually well or the teacher’s incredible skill resulted in a fantastic learning environment, and thus most of the students demonstrated mastery of the material, then it will be difficult to scale the scores (put them in a normal distribution).

The third assumption is a bit of a stretch, but I have direct experience with it. I was in a pre-med biology course that was known to be a “screening” course, and the professor was a crusty old man who had assigned seats (all 200 of them) and took attendance every day. And he used the same tests from semester to semester, at a university with an honor code. The only problem was that the honor code didn’t cover reused tests. The professor numbered the tests in red ink, and tracked to make sure he got every test back each test day. And three days after a test, he posted the results, with scaled score cutoffs for each letter grade.

But it was well known that some student fraternal organizations had file cabinets full of old tests. To make matters worse, because the professor’s tests were so old and textbooks had been revised, there were questions on each test that were not in the book, nor covered in class. This meant that fraternity members who had access to the test questions had a clear advantage: the scaling of scores meant that they could answer correctly on questions that no one else knew, putting them higher in the “curve.”

I talked with the professor about that problem. His response was that it would be cheating to have copies of the test (though it was not, under the honor code), and he had carefully controlled the test distribution, so no one could have possibly taken it from the testing room. I pointed out that it would be easy for each person in the group to memorize, say, 5 questions, and pool their data. The really tough part for me involved the questions that were no longer in the text or covered in class. Knowing the answers to these gave “extra credit” to those who were in the know, crucial extra points that put them at an unfair advantage. (When scores are scaled, the difference between an A and a B can be just a couple questions.)

The professor didn’t budge; to him, it was impossible that his test security had been breached. He noted that it didn’t matter if test questions hadn’t been covered, because each student had the same chance of getting it right. And he dutifully continued to scale the scores, oblivious to what was going on. Luckily, the next semester we had a visiting professor, who was exciting and inventive. His tests were freshly constructed each semester from a pool of questions he had written on index cards, and because the professor was new-to-us, no one had an unfair advantage.

These three assumptions undergird another, more philosophical, assumption about a school’s mission: Scaling scores assumes that students must compete against each other to be graded. In other words, it is assumed that mastery of the material is not an appropriate criterion for a methods-based course (such as the ones my friend teaches). From this perspective, scaling of scores is a zero-sum game, in which the number of losers equals the number of winners; the number of Cs = the number of As. Is this what education is about?

Perhaps scaling scores is useful when it is necessary to “weed out” students who are not well-suited to a particular profession, or when it is helpful to identify those with the best and very-best potential (such as in the hierarchical caste system used in medical schools). When the vast majority of those in a program are excellent students or when the playing field is not level, scaling of scores can be counterproductive.

This entry was posted in assumptions, bell curve, bell-shaped curve, grade inflation, grading, normalization of scores, statistics, testing. Bookmark the permalink.