You’ve probably heard of score scaling and may have wondered how it works and what purpose it serves. The process of scaling is simpler than you may think, and understanding it further might help you determine whether it can be helpful for your program.
What is Scaling?
Often times for certification examinations, scores are initially computed on a raw scale (sometimes called the true score scale), where a total score is just the number of items a candidate answered correctly, and each point represents one correct answer. Colloquially, we use the term scaling (or “scaled scoring”) in reference to when raw scores are converted to a different scale, typically for reporting scores to candidates. So, technically, scaled scoring is a different kind of scoring process – it is a different way of presenting scores.
In the example of scaling in Figure 1, a 150-item examination with a raw cut score of 100 is converted to a scale of 0 to 300 with a scaled cut score of 200. Scaling does not change the length, difficulty, or number of items answered correctly required to pass the examination. Similar to converting from inches to centimeters, scaling just changes the scale on which scores are reported.
When is Scaling Useful In Scoring Your Certification Examination?
Any examination can be scaled, but some circumstances may make scaling a more attractive option over reporting raw scores. Here are a few situations where scaling may be preferable:
When the raw cut score changes from form to form. This may happen if two assembled forms could not be made parallel in terms of difficulty, or if equating is conducted after an administration window. By placing all forms of an examination on the same scale, the credentialing organization can provide a consistent means of score interpretation. Furthermore, different cut scores for different forms of an exam can sometimes confuse candidates (e.g., two candidates get the same raw score on different forms, but one candidate passes and one fails). Though outcomes may be psychometrically and statistically justified, they can have the appearance of being unfair; scaling can help mitigate such confusion.
When the credentialing body offers several credentials and examinations. Many credential-granting organizations have more than one credential, and often times the examination lengths and cut scores differ across programs. Placing examination scores for all programs on the same scale may present a more cohesive face of the programs to candidates. This may also allow for more consistent explanations of scoring and scaling in the respective candidate handbooks.
When the examination scores are not based on number of correct answers. This can be the case when employing Item Response Theory (IRT) for scoring. Without getting too much into the weeds, some IRT models provide an estimate of candidate ability on a scale that is not terribly conducive to meaningful candidate interpretation (e.g., -3 logits to +3 logits). In the case of computer adaptive testing (CAT), scaling the ability estimates is a necessity because different candidates receive a different number of items to determine their score.
Scaling can be a fairly straight forward conversion of raw scores, and can help establish consistency of score reporting. In providing that consistency, scaling allows for better management of the score interpretation, which may be desirable for credentialing organizations with multiple forms of an exam, multiple programs, or IRT-based scoring.