Reliability, Validity, Assessment centers, Interdisciplinary, Medicine


This study examined the reliability and validity of scores taken from a series of four task simulations used to evaluate medical students. The four role-play exercises represented two different cases or scripts, yielding two pairs of exercises that are considered alternate forms. The design allowed examining what is essentially the ceiling for reliability and validity of ratings taken in such role plays. A multitrait-multimethod (MTMM) matrix was computed with exercises as methods and competencies (history taking, clinical skills, and communication) as traits. The results within alternate forms (within cases) were then used as a baseline to evaluate the reliability and validity of scores between the alternate forms (between cases). There was much less of an exercise effect (method variance, monomethod bias) in this study than is typically found in MTMM matrices for performance measurement. However, the convergent validity of the dimensions across exercises was weak both within and between cases. The study also examined the reliability of ratings by training raters to watch video recordings of the same four exercises who then complete the same forms used by the standardized patients. Generalizability analysis was used to compute variance components for case, station, rater, and ratee (medical student), which allowed the computation of reliability estimates for multiple designs. Both the generalizability analysis and the MTMM analysis indicated that rather long examinations (approximately 20 to 40 exercises) would be needed to create reliable examination scores for this population of examinees. Additionally, interjudge agreement was better for more objective dimensions (history taking, physical examination) than for the more subjective dimension (communication).