CAT Exams: Development and Scoring

About Test Difficulty

The CAT exam has been developed to accurately identify top performing candidates and that design makes use of a scaled score range of 0 to 450. In order to appropriately identify the top performing candidates, the CAT exam is, by design, a very difficult exam. As would be expected with the more difficult CAT exam, no candidate would likely answer 100% of the items correctly or achieve the top theoretical score. The exam design will accomplish the goal of identifying the top performing candidates who are, indeed, ranked at the top of the list. If the exam were designed to be substantially easier, it would be theoretically possible for a candidate to achieve a score of 450. However, an exam constructed to be that easy would not serve the distinct purposes of the IIMs.

Reference

American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). (1999). Standards for Educational and Psychological Testing. Washington, D.C.: Author.

Educational Testing Service (ETS). (2002) ETS Standards for Quality and Fairness. Princeton, N.J.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling and linking: Methods and practices. 2nd Ed. Springer.

Fairness and Equivalency in IIM Exams

A significant number of examination forms are used on behalf of the IIMs to evaluate the large candidate population. With the use of multiple forms comes the need to ensure fairness and equivalency of the examinations used for assessment. A post-equating process is necessary to ensure validity and fairness.

Equating is a psychometric process to adjust differences in difficulty so that scores from different test forms are comparable on a common metric and therefore fair to candidates testing across multiple days. The equating process was designed with three phases: exam creation, post-equating, and scaling.

Each form contains a pre-defined number of statistically profiled questions selected from a large bank. These questions form an equating block within a form, which can be used as an anchor to rescale candidates’ scores to the metric of the item bank. This rescaling process adjusts for the difference in the form difficulties, taking into account of candidates’ differential performance on the equating block. As a result, the candidates’ rescaled scores can be placed and compared on the common metric regardless which form they take.

This approach provides support for equating without significantly impacting the security of the items. The second phase of the process is post equating. In this process, items are concurrently analyzed and the estimated item parameters (item difficulty and item discrimination) are put onto a common metric. Item Response Theory (IRT), a psychometrically supported statistical model, is utilized in this process. The result is a statistically equated raw score that takes into account the performance of the candidate along with the difficulty of the form administered.

Once post-equating has resulted in an equated raw score, scaling of the scores is done to reduce confusion to candidates. Scaling can be done using a linear or non-linear transformation of the original, equated number correct. Though the number as presented to candidates is placed on a common scale for ease of interpretation, the position of candidates in the score distribution does not change.

Lastly, once scaled scores are established, the final step in the scoring process is to rank candidates in their performance. A percentile rank is the percentage of scores that fall below a given score. With the total scale scores arranged in rank order from the lowest to highest, in 100 equally sized groups, a table with the total scale scores to percentile ranks will be created. This ranked list of candidates will allow for the identification of candidates from the highest performers at the very top of the list to the lower performers in the middle and low end of the scale.

The test development and equating models outlined have substantial advantages to candidates. First, they confirm with a high level of psychometric rigor that all examination scores are valid, equitable and fair. Post equating takes into account any statistical differences in examination difficulty and ensures all candidates are evaluated on a common scale. Reporting scores on this statistically equivalent scale creates an environment where the very high performing candidates will be ranked appropriately at the top end of the scale.