Chapter 2 CTT and IRT

In this session, we will discuss the main features and distinctions between Classical Test Theory (CTT) and Item Response Theory (IRT).

The main reason for administering a test is for finding out students’ ability levels in the area of the assessment. In addition, we may also like to know characteristics of the test and items, such as the difficulty levels of the items, and whether the test works well in separating students into different levels of ability (i.e., the notion of test reliability and item discrimination).

Both CTT and IRT will provide us with information mentioned above. We will first discuss CTT methods.

2.1 CTT ability measure

Students’ test scores, or percent correct on a test, are used as measures of student ability under CTT.

2.2 CTT item difficulty

Similar to ability measures, the percent of students who obtained the correct answer on an item is used as the item’s difficulty measure.

2.3 CTT test reliability

CTT reliability shows the degree to which we can trust the rankings of students based on the test. It shows the correlation between the student scores on two similar (parallel) tests.

2.4 CTT item discrimination (point-biserial)

For each item, the correlation between students’ total scores and their scores on this item is the item discrimination. It is an important statistic to evaluate the quality of an item. For example, if students randomly guess the answer of an item, then the correlation between the scores on this item and the total scores will be close to 0. A high item discrimination shows that higher ability students obtained the correct answer (score of 1) while lower ability students obtained the incorrect answer (score of 0). Consequently, it shows that the item is highly related to the total score, and the item reflects the construct we are measuring.

It is really important to separate the concepts of item difficulty and item discrimination.

As CTT item/test statistics are based on total scores on a test, the interpretations of the statistics need to be in the context of the particular test administered. For example, if students took different sets of items from the test, then their total scores are not comparable, so CTT statistics will not work.

2.5 IRT

In this session, we will give a brief description of the concepts of IRT. The topic of IRT will be discussed in more details in later sessions. A mathematical model is used in IRT to model the probability of correctly answering an item by a student. This probability is calculated as a function of the item’s difficulty and the student’s ability measure. The more able a student is, the higher the probability the student will correctly answer an item. Similarly, the easier an item is, the higher will be the probability for a student to get the right answer.

2.6 IRT ability measures

IRT ability measures are on a scale called the “logit” scale. Logit is short for “log-odds-unit” or “logistic unit.” Unlike test scores which are in a range between 0 and 100%, logits goes from \(-\infty\) to \(\infty\).

The assumption that students’ ability measures are not bounded helps to overcome some problems related to ceiling and floor effects of a test which distort the test scores under CTT as measures of ability.

2.7 IRT item difficulty

IRT item difficulty is an ability measure at which a student has a probability of 0.5 of answering the item correctly. That is, if an item has a difficulty of \(\delta\) , then 50% of the students with ability \(\delta\) will obtain the correct answer on the item. The fact that IRT item difficulty is defined on the ability scale allows for direct inferences on the probability a student can correctly answer a question. This is the most important characteristic of IRT. In contrast, ability measures and item difficulties under CTT are not directly comparable.

2.8 IRT test reliability

Similar to CTT test reliability, IRT test reliability is based on students’ ability measures in logits rather than in test scores. The interpretation of IRT reliability is still the same as for CTT.

2.9 IRT item discrimination

IRT item discrimination can be obtained through residual-based item fit statistics, as well as correlations between student ability measures in logits and item scores. Again, the interpretations of these statistics are very similar to that of CTT.

2.10 IRT not tied to particular tests

IRT ability measures can be obtained even when students take different sets of items of a test, because the ability measures are estimated using the item difficulties of the set of items a student takes.

2.11 In summary

When all students take the same test, there is not a great deal of difference between CTT and IRT. But IRT results are more generalisable. Further, IRT allows for skill descriptions of what each students can do because ability measures and item difficulties are placed on the same scale. We can make statements such as “this student is likely to be able to answer items at these difficulty levels, but not items at these difficulty levels.”

2.12 Practical work 1

Below is a hands-on practice of using R to carry out CTT test and item analyses.

Type the following in the R Studio editing window. Run each line, one at a time (select the line and press Run).

rm(list=ls())  #remove all objects in the R environment
library(TAM)   #load the TAM package
data(data.sim.rasch) #fetch the data file called data.sim.rasch from the TAM package
head(data.sim.rasch) #show the first few lines of the data set

resp <- data.sim.rasch #shorten the data name

scores <- apply(resp,1,sum) #compute test scores
head(scores) #take a look at the first few lines of scores

itemscores <- apply(resp,2,mean) #compute item means

#compute item discrimination for item 1
cor(scores,resp[,1]) #discrimination for item 1
cor(scores-resp[,1],resp[,1]) #better, remove item 1 score from total score

#compute item discrimination for all items
disc <- apply(resp,2,function(x){cor(scores-x,x)})

#Easier to use the CTT package for all the above
IA <- itemAnalysis(resp) #store results in an object called IA
IA$itemReport #IA contains many components of the results. Use"$" after IA to show different components

2.13 Practical work 2

The CTT package provides a data set called CTTdata. This data set contains “raw” item responses to 20 multiple choice items, not scored. Use the following code to call up the data set and the keys (correct answers) for the items.

library(CTT) #make sure the library CTT is loaded


#Use the CTT function score to score the raw data
resp <- score(CTTdata,CTTkey,output.scored = TRUE)

#resp contains two data sets: resp$scored and resp$score
#carry out item analysis for the data in "resp$scored"

2.14 Homework

The TAM package provides a data set called data.numeracy. The data set contains both raw responses and scored responses (data.numeracy$raw, data.numeracy$scored). Use CTT package to analyse both the raw data and scored data in data.numeracy. The keys for this data can be set using R code

key <- c(1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1)

The results for both the raw and scored data sets should be the same.