# What are scaled scores? Uses—and misuses

A friend of mine was taking certification test for a well-known technology certification program. He studied for months, and passed both of them, though, he noted, “just barely.” I asked to see the printouts.

There was one word on the certification that caught my eye, buried in the second sentence; in fact, it was the first word I read on the form. “Your scaled score….” While he was studying for the tests and taking sample exams, he was aiming for getting 85% of the answers right. My friend had read, somewhere in the instructions, that he had to get 85 per cent correct in order to pass. But the fact was that he had to be in the 85th percentile in order to pass. That’s a far different number. While 85% gives some idea of the level of mastery of the material, the 85th percentile has to do with the people who initially took the test, on which the test was normed.

So when he showed me his certificate, it showed that the scores were scaled between 100 and 900, with a mean of 500. A score of 700 was necessary to pass, and he got a 720. But a scaled score is one that has been normalized; that is, the number of answers needed to get a given score are “adjusted” according to the percentage of test-takers that got that number of questions right (this is known as the raw score). So he didn’t “barely pass” as we have no way of knowing how many questions he answered correctly, compared to the cutoff score. I pointed out that he did quite well: his score was better than 5 out of 6 people taking the test. In my book, he earned an “A+.”

There are two problems here. The first is that the certification agency is not providing a criterion for mastery of the material. This test was validated by giving it to a sample group of test-takers, and these data allowed for the scores for all who take the test to be scaled in relation to the group who served as the norm (called a norming sample).

As more people gain mastery of these computer concepts, the bar is going to be raised—arbitrarily—higher and higher. Only 15% of those who take the test will pass, by definition. This criterion has nothing to do with mastery, only with ensuring that the testing company will have repeat business! If only 1 in 6 pass the first time, then most people will have to take the test multiple times to get certified.

The second problem is that a person studying for the test has no idea how much of the material must be learned, that is, what level of mastery is required for certification. A passing score for certification is defined only by the group who serve to standardize the test. Students who take the sample tests are given a raw score (percentage of correct answers), but no guidelines as to how close they are to passing the test. And the lay person doesn’t know what “scaled scores” are. Therefore, most people taking the test assume that “85 percentile” means that they have to get 85% of the questions correct to pass. In fact, I asked several technology instructors about this, and they thought it was necessary to get 85% correct to be certified!

What are the assumptions here? One is that mastery of material is defined by the population who standardize the test, and is only indirectly related to the level of understanding of the material. If a group of very knowledgeable people was used to standardize the test scores, the bar would be set high indeed, as only 1 out of 6 people who had that level of knowledge would pass! If a group of people who had poor skills was the norming sample, it would be meaningless to get certified.

The second assumption is that certification is relevant only in relation to others who have taken the test. The third assumption is that no cheating occurs—if, for example, answers to the hardest 10 questions were passed around among a group of trainees, or if one study guide provided those answers and no other one did, then some groups would have a definite advantage, and could potentially throw off the standardization, if their scores were included in the normalization sample.

The use of scaled scores is rarely appropriate for demonstrating mastery of the material, when a criterion is set for mastery based solely on where a person falls on the distribution. This is especially true when only a few are allowed to pass the test.

On the other hand, medical boards (e.g., STEP tests) scores are scaled, but the “passing” score allows much more of those taking the tests to “pass.” But medical boards are also used for placement in residencies, so those who get the very highest scores are favored for the “best” residencies. This dual use of score reporting is much more appropriate.

Another example: for one medical recertification board exam, 85% of those taking the test passed, according to the board’s published figures. (This percentage varies from year to year, and is not fixed, but it is very close to 85%.) These tests are taken by doctors who have practiced in their field for years, and are designed to identify doctors who may need additional training to maintain their skills. However, it should be noted that doctors are not eliminated solely based on their percentile (score in relation to others). Rather, these figures vary from year to year with this board, implying that a certain level of mastery is set, and a varying number of examinees don’t meet the standard from year to year.

So the final assumption is one we take for granted: that institutions have a valid reason, a scientific basis, for using scaled scores. For the certification exam mentioned earlier, I’m not so sure.

This entry was posted in assumptions, certification, quantitative research, testing, Uncategorized. Bookmark the permalink.