Graduation Year

2006

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Measurement and Evaluation

Major Professor

John Ferron, Ph.D.

Keywords

Mode effect, Item bias, Computer-based tests, Eifferential item functioning/, Dichotomous scoring

Abstract

When an exam is administered across dual platforms, such as paper-and-pencil and computer-based testing simultaneously, individual items may become more or less difficult in the computer version (CBT) as compared to the paper-and-pencil (P&P) version, possibly resulting in a shift in the overall difficulty of the test (Mazzeo & Harvey, 1988). Using 38,955 examinees' response data across five forms of the National Nurse Aide Assessment Program (NNAAP) administered in both the CBT and P&P mode, three methods of differential item functioning (DIF) detection were used to detect item DIF across platforms. The three methods were Mantel-Haenszel (MH), Logistic Regression (LR), and the 1-Parameter Logistic Model (1-PL). These methods were compared to determine if they detect DIF equally in all items on the NNAAP forms. Data were reported by agreement of methods, that is, an item flagged by multiple DIF methods. A kappa statistic was calculated to provide an index of agreement bet

ween paired methods of the LR, MH, and the 1-PL based on the inferential tests. Finally, in order to determine what, if any, impact these DIF items may have on the test as a whole, the test characteristic curves for each test form and examinee group were displayed. Results indicated that items behaved differently and the examinee's odds of answering an item correctly were influenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an average of 29%. The test characteristic curves for each test form were examined by examinee group and it was concluded that the impact of the DIF items on the test was not consequential. Each of the three methods detected items exhibiting DIF in each test form (ranging from 14 items to 23 items). The Kappa statistic demonstrated a strong degree of agreement between paired methods of analysis for each test form and each DIF method pairing (reporting good to excell

ent agreement in all pairings). Findings indicated that while items did exhibit DIF, there was no substantial impact at the test level.

Share

COinS