What does X represent in the classical test theory equation?

  • Journal List
  • HHS Author Manuscripts
  • PMC4096146

Clin Ther. Author manuscript; available in PMC 2015 May 5.

Published in final edited form as:

PMCID: PMC4096146

NIHMSID: NIHMS594145

Abstract

Introduction

The U.S. Food and Drug Administration’s patient-reported outcome (PRO) guidance document defines content validity as “the extent to which the instrument measures the concept of interest” (FDA, 2009, p. 12). “Construct validity is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity” (Strauss & Smith, 2009, p. 7). Hence both qualitative and quantitative information are essential in evaluating the validity of measures.

Methods

We review classical test theory and item response theory approaches to evaluating PRO measures including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized “difficulty” (severity) order of items is represented by observed responses.

Conclusion

Classical test theory and item response theory can be useful in providing a quantitative assessment of items and scales during the content validity phase of patient-reported outcome measures. Depending on the particular type of measure and the specific circumstances, either one or both approaches should be considered to help maximize the content validity of PRO measures.

Keywords: Content validity, patient-reported outcomes, classical test theory, item response theory, scale development

Introduction

The publication of the U.S. Food and Drug Administration’s patient-reported outcome (PRO) guidance document1 has generated discussion and debate around the methods used to develop a PRO instrument and to establish the content validity of a PRO instrument. The guidance outlines the information that the FDA will consider when evaluating a PRO measure as a primary or secondary endpoint to support a medical product label claim. The PRO Guidance highlights the importance of establishing evidence of content validity, defined as “the extent to which the instrument measures the concept of interest” (p. 12).1

Content validity is the extent to which an instrument covers the important concepts of the unobservable, or latent attribute (e.g., depression, anxiety, physical functioning, self-esteem), that the instrument purports to measure. It is the degree to which the content of a measurement instrument is an adequate reflection of the construct being measured. Hence, qualitative work with patients is essential to ensure that a PRO instrument captures all of the important aspects of the concept from the patient’s perspective.

Two reports from the International Society of Pharmacoeconomics and Outcomes Research (ISPOR) Good Research Practices Task Force2–3 detail the qualitative methodology and five steps that should be employed to establish content validity of a PRO measure. These five steps cover the following general themes: 1) determine the context of use (e.g., medical product labelling); 2) develop the research protocol for qualitative concept elicitation and analysis; 3) conduct the concept elicitation interviews and focus groups; 4) analyze the qualitative data; and 5) document concept development, elicitation methodology, and results. Essentially, the inclusion of the entire range of relevant issues in the target population embodies adequate content validity of a PRO instrument.

While qualitative data from interviews and focus groups with the targeted patient sample is necessary to develop PRO measures, qualitative data alone is not sufficient to document the content validity of the measure. Along with qualitative methods, quantitative methods are needed to develop PRO measures with good measurement properties. Quantitative data gathered during earlier stages of instrument development can serve at least three purposes: 1) as a barometer to see how well items address the entire continuum of the targeted concept of interest; 2) as a gauge of whether to go forward with psychometric testing; and 3) as a meter to mitigate risk related to phase 3 signal detection and interpretation.

Specifically, quantitative methods can support development of PRO measures by addressing several core questions of content validity. What is the range of item responses relative to the sample (distribution of item responses and their endorsement)? Are the response options used by patients as intended? Does a higher response option imply more of a health problem than a lower response option? What is the distance between response categories in terms of the underlying concept?

Also relevant is the extent that the instrument reliably assesses the full range of the target population (scale-to-sample targeting), ceiling or floor effects, and the distribution of the total scores. Does the item order with respect to degree of severity of the disease reflect the hypothesized item order? To what extent do item characteristics relate to how patients rank the items in terms of their importance or bother?

This paper reviews classical test theory and item response theory approaches that can be used in developing PRO measures and in addressing questions such as those raised above. These content-based questions and the two quantitative approaches for addressing them are consistent with construct validity now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity (p. 7).4 The use of quantitative methods early in instrument development is aimed at providing descriptive profiles and exploratory information about the content represented in a draft PRO instrument. Confirmatory psychometric evaluations, occurring at the later stages of instrument development, should be used to provide more definitive information regarding the measurement characteristics of the instrument.

Classical Test Theory

Classical test theory is a traditional quantitative approach to testing the reliability and validity of a scale based on its items. In the context of PRO measures, classical test theory assumes that each observed score (X) on a PRO instrument is a combination of an underlying true score (T) on the concept of interest and unsystematic (i.e., random) error (E). Classical test theory, also known as true score theory, assumes that each person has a true score, T, that would be obtained if there were no errors in measurement. A person’s true score is defined as the expected score over an infinite number of independent administrations of the scale. Scale users never observe a person’s true score, only an observed score, X. It is assumed that observed score (X) = true score (T) plus some error (E).

True scores quantify values on an attribute of interest, defined here as the underlying concept, construct, trait, or ability of interest (the “thing” intended to be measured). As values of the true score increase, responses to items representing the same concept should also increase (i.e., there should be a monotonically increasing relationship between true scores and item scores), assuming that item responses are coded so that higher responses reflect more of the concept.

It is also assumed that random errors (i.e., the difference between a true score and a set of observed scores on the same individual) found in observed scores are normally distributed and, therefore, that the expected value of such random fluctuations (i.e., mean of the distribution of errors over a hypothetical infinite number of administrations on the same subject) is taken to be 0. In addition, random errors are assumed to be uncorrelated with a true score, with no systematic relationship between a person’s true score and whether that person has positive or negative errors.

Descriptive Assessment

In the development of a PRO measure, the means and standard deviations of the items can provide fundamental clues about which items are useful for assessing the concept of interest. Generally, the higher the variability of the item scores and the closer the mean score of the item is to the center of its distribution (i.e., median), the better the item will perform in the target population.

In the special case of dichotomous items, scored “0” for one response and “1” for the other, the proportion of respondents in a sample who select the response choice scored “1” is equivalent to the mean item response. If a particular response option (typically given a response choice scored “1’) represents an affirmative response or the presence of what is asked, the item is said to be endorsed. Items for which everyone gives the same response are uninformative because they do not differentiate between individuals. In contrast, dichotomous items that yield about an equal number of people (50%) selecting each of the two response options provide the best differentiation between individuals in the sample overall. For items with ordinal response categories, which have more than two categories, an equal or a uniform spread across response categories yields the best differentiation. While ideal, such a uniform spread is typically difficult to obtain (unless the researcher makes it a direct part of the sampling frame during the design stage) as it depends in part on the distribution of the sampled patients, which is outside the full control of the researcher.

Item difficulty, which is taken from educational psychology, may or may not be an apt term in health-care settings. In this paper we equate item difficulty with item severity, and we use the two terms interchangeably. The more suitable and interpretable term depends on the particular health-care application. Item difficulty, or severity, can be expressed on a z-score metric by transforming the proportion endorsed using the following formula: z = ln[p/(1 - p)]/1.7, where z scores come from a standardized normal distribution (with mean of 0, standard deviation of 1), ln represents the natural logarithm, p represents the probability endorsed, and 1.7 is a scaling factor for a normal distribution. The z-scores of the items can be ordered so that, for instance, the items with higher z-scores are considered more difficult relative to the other items. For items with ordinal response categories, adjacent categories can be combined meaningfully to form a binary indicator for the purpose of examining item severity (difficulty).

Item Discrimination

The more an item discriminates among individuals with different amounts of the underlying concept of interest, the higher the discrimination index.5 The discrimination index can be applied with a binary item response or an ordinal item response made binary (by combining adjacent categories meaningfully to form a binary indicator). The extreme group method can be used to calculate the discrimination index using the following three steps.

Step 1 is to partition respondents who have the highest and lowest overall scores on the overall scale, aggregated across all items, into upper and lower groups. The upper group can be composed of the top x% (e.g., 25%) of scores on the scale, while the lower group can be composed of the bottom x% (e.g., 25%) of scores on the scale.

Step 2 is to examine each item and determine the proportion of individual respondents in the sample who endorse or respond to each item (or to a particular category or adjacent category groups of an item) in the upper group and lower group.

Step 3 is to subtract the pair of proportions noted in Step 2. The higher this item discrimination index, the more the item discriminates. For example, if 60% of the upper group and 25% of the lower group endorse a particular item in the scale, the item discrimination index for that item would be calculated as (0.60−0.25) = 0.35. It is useful to compare the discrimination indexes for each of the items in the scale, as illustrated in Table 1. In this example, Item 1 provides the best discrimination, Item 2 provides the next best discrimination, and Items 3 and 4 are poor at discriminating.

Table 1

Illustration Example of Using the Item Discrimination Index

ItemProportion Endorsed for Upper GroupProportion Endorsed for Lower GroupItem-Discrimination Index
1 0.90 0.10 0.80
2 0.85 0.20 0.65
3 0.70 0.65 0.05
4 0.10 0.70 −0.60

Another indicator of item discrimination is how well an item correlates with the sum of the remaining items on the same scale or domain, or the corrected item-to-scale correlation (“corrected” because the sum or total score does not include that item). It is best to have relatively “large” corrected item-to-scale correlations (e.g., 0.37 and above according to Cohen’s rule of thumb).6 An item with a low corrected item-to-scale correlation indicates that this item is not as closely associated with the scale relative to the rest of the items in the scale.

An item’s response categories can be assessed by analyzing the item response curves, which is produced descriptively in classical test theory by plotting the percentage of subjects choosing each response option on the y-axis and the total score, expressed as such or percentiles or other metric, on the x-axis. Figure 1 provides an illustration. Item 1 is equally good at discriminating across the continuum of the attribute (the concept of interest). Item 2 discriminates better at the lower end than at the upper end of the attribute. Item 3 discriminates better at the upper end, especially between 70th and 80th percentiles.

What does X represent in the classical test theory equation?

Illustrative Item Response Curves

An item difficulty-by-discrimination graph, shown in Figure 2, depicts how well the items in a scale span across the range of difficulty (or severity) along with the how well each item represents the concept. The data shown are 16 items included in a 50-item test administered to five students. (Thirty-three of the items in the test were answered correctly by all five students while one of the items was answered incorrectly by all five students.) The figure shows the 16 items by sequence number in the test. For example, the third item in the test is shown in the upper left of Figure 2 and is labeled as “3.” This item (Which of the following could be a patient-reported measure?) was answered correctly by four of the five students (80%) and had a corrected item-total correlation of −0.63 (“corrected” to remove that item from the total scale score). The reason this item had a negative correlation with the total scale score is the student who had the best test score was the only student of the five who got the question wrong. Items 10 and 27 both were answered correctly by two of the five students (40%) and each had an item-scale correlation of 0.55. The “easiest” or “least severe” items (100% of students got them right) and the “hardest” or “most severe” item (0% of students got it right) are not shown in the figure. The range of difficulty estimates for the other items is limited by the small sample of students, but Figure 2 shows that eight of the items were easier items (80% correct), two items were a little harder (60% correct), three items were even harder (40% correct), and three other items were among the hardest (20% correct).

What does X represent in the classical test theory equation?

Difficulty-by-Discrimination Graph

Note: Numbers in the figure represent sequence in which the item was administered in a 50-item test given to five students. Sixteen items are shown in the figure. Thirty-three items that all the students answered correctly and one item that no students answered correctly are not shown.

Dimensionality

To evaluate the extent to which the items measure a hypothesized concept distinctly, item-scale correlations on a particular scale intend to that hypothesized concept (corrected for item overlap with the total scale score) can be compared with correlations of those same items with other scales (either sub-scales within the same PRO measure or different scales from different PRO measures). This approach has been referred to as multi-trait scaling analysis and can be implemented, for example, using a SAS macro.7 While some users of the methodology suggest it evaluates item convergent and discriminant validity, we prefer “item convergence within scales” and “item discrimination across scale,” because one learns that items sort into different scales (“bins”) but the validity of the scales per se is still unknown.

Factor analysis is a statistical procedure that is analogous to multi-trait scaling.8 In exploratory factor analysis, there is uncertainty as to the number of factors being measured; the results of the analysis are used to help identify the number of factors. Exploratory factor analysis is suitable for generating hypotheses about the structure of the data. In addition, it can help in further refining an instrument by revealing what items may be dropped from the PRO instrument because they contribute little to the presumed underlying factors. While exploratory factor analysis explores the patterns in the correlations of items (or variables), confirmatory factor analysis (which is appropriate for later stages of PRO development) tests whether the variance-covariance of items conform to an anticipated or expected scale structure given in a particular research hypothesis. While factor analysis is not necessarily connected to content validity per se, it can be useful for testing the conceptual framework of the items mapping to the hypothesized underlying factors.

Reliability

Reliability is important in the development of PRO measures, including for content validity. Validity is limited by reliability. If responses are inconsistent (not reliable), it necessarily implies invalidity as well (note that the converse is not true: consistent responses do not necessarily imply valid responses).9

While we are primarily concerned with validity in early scale development, reliability is a necessary property of the scores produced by a PRO instrument and it is important to consider in early scale development. Reliability refers to the proportion of variance in a measure that can be ascribed to a common characteristic shared by the individual items; whereas validity refers to whether that characteristic is actually the one intended. Test-retest reliability, which can apply to both single-item scales and multi-item scales, reflects the reproducibility of scale scores upon repeated administrations over a period of time when the respondent’s condition did not change. As a way to compute test-retest reliability, the kappa statistic can be used for categorical responses and the intraclass correlation coefficient can be used continuous responses (or responses taken as such).

Further, having multiple items in a scale increases its reliability. For multi-item scales, one of the most common indicators of scale reliability is Cronbach’s coefficient alpha, which is driven by two elements: the correlations between the items and the number of items in the scale. In general, a measure’s reliability equals the proportion of total variance among its items that is due to the latent variable and is thus considered communal or shared variance.

The greater the proportion of shared variation, the more the items share in common and the more consistent they are in reflecting a common true score. The covariance-based formula for coefficient alpha expresses such reliability while adjusting for the number of items contributing to the prior calculations on the variances. The corresponding correlation-based formula, an alternative expression, represents coefficient alpha as the average inter-item correlation among all pairs of items after adjusting for the number of items.

Sample Size Considerations

In general, different study characteristics affect sample size considerations, such as the research objective, type of statistical test, sampling heterogeneity, statistical power or level of confidence, error rates, and the type of instrument being tested (e.g., the number of items and the number of categories per item). In the quantitative component of content validity, a stage considered exploratory, a reliable set of precise values for all measurement characteristics is not expected and, as such, formal statistical inferences are not recommended. Consequently, sample size adequacy in this early stage, which should emphasize absolute and relative directionality, does not have the same level of importance as it would in later (and especially confirmatory) phases of PRO instrument development, regardless of the methodology employed.

Nevertheless, sample sizes based on classical test theory should be large enough for the descriptive and exploratory pursuit of meaningful estimates from the data. While it’s not appropriate to give one number for sample size in all such cases, starting with a sample of 30 to 50 subjects may be reasonable in many circumstances. If no clear trends emerge, adding more subjects may be needed to observe any noticeable patterns. It should be emphasized that an appropriate sample size depends on the situation at hand, such as the number of response categories. An 11-point numeric rating scale, for instance, may not have enough observations in the extreme categories and may require a larger sample size. In addition to an increase in the sample size, another way to have a more even level of observations across categories of a scale is to plan for it at the design stage by recruiting individuals who provide sufficient representation across the response categories.

Sample sizes for more rigorous quantitative analyses, at later stages of psychometric testing, should be large enough to meet a desired level of measurement precision or standard error.10 With sample sizes of 100, 200, 300, and 400, the standard errors around a correlation are approximately 0.10, 0.07, 0.06 and 0.05, respectively. Various recommendations have been given for exploratory factor analyses. One recommendation is to have at least five cases per item and a minimum of 300 cases.11 Another rule of thumb is to enlist a sample size of at least 10 times the number of items being analysed, so a 20-item questionnaire would require at least 200 subjects. It should be stressed, however, that adequate sample size is directly dependent on the properties of the scale itself, rather than rules of thumb. A poorly defined factor (e.g., one with not enough items) or weakly related items (low factor loading) may require substantially more individuals to get precise estimates.12

For confirmatory factor analysis, rules of thumb have been offered about the minimum number of subjects per each parameter to be estimated (e.g., at least 10 subjects per parameter). The same caution given about such rules of thumb for exploratory factor analysis also applies to confirmatory factor analysis. If a measure is to be used in a specific subgroup (e.g., Asian-Americans), then a sufficient sample size is needed to represent that subgroup. Statistical power and sample sizes for confirmatory factor analysis are explained in more detail elsewhere.13

In some situations (e.g., when large patient accrual is not feasible or when responses are diverse or heterogeneous in spanning across item categories) a smaller sample size might be considered sufficient. In these situations, analytical methods can include simple descriptive statistics (item-level means, standard errors and counts and, in addition, correlations between items) for the items and subscales of a PRO measure. Replication of psychometric estimates is needed either by a sufficiently large and representative sample that can be split into two subsamples for cross-validation or two samples of sufficient sample size. One sample is used to explore the properties of the scale and the second sample is used to confirm findings found with the first sample. If the results of the two samples are inconsistent, then psychometric estimates from another sample may be required to establish the properties of the measure.

Item Response Theory

Item response theory (IRT) is a collection of measurement models that attempt to explain the connection between observed item responses on a scale and an underlying construct. Specifically, IRT models are mathematical equations describing the association between subjects’ levels on a latent variable and the probability of a particular response to an item, using a non-linear monotonic function.14 As in classical test theory, IRT requires that each item should be distinct from the others yet should be similar and consistent with them in reflecting all important respects of the underlying attribute or construct. Item parameters in IRT are estimated directly using logistic models instead of proportions (difficulty or threshold) and item-scale correlations (discrimination). There are a number of IRT models varying in the number of parameters (one, two and three-parameter models) and whether they handle dichotomous only or polytomous items more generally (see Table 2).

Table 2

Common IRT Models Applied to Patient-Reported Outcomes

ModelItem Response FormatModel Characteristics
Rasch/1-Parameter Logistic Dichotomous Discrimination power equal across all items. Threshold varies across items.
2-Parameter Logistic Dichotomous Discrimination and threshold parameters vary across items.
Graded Response Polytomous Ordered responses. Discrimination varies across items.
Nominal Polytomous No pre-specified item order. Discrimination varies across items.
Partial Credit (Rasch Model) Polytomous Discrimination and power constrained to be equal across items.
Rating Scale (Rasch Model) Polytomous Discrimination equal across items. Item threshold steps equal across items.
Generalized Partial Credit Polytomous Variation of Partial Credit Model with discrimination varying across items.

In the simplest case, item responses are evaluated in terms of a single parameter, difficulty (severity). For a binary item, the difficulty, or severity, parameter indicates the level of the attribute (e.g., the level of physical functioning) at which a respondent has a 50% chance of endorsing the dichotomous item. In this paper, without loss of generality, items with higher levels of difficulty or severity are those that require higher levels of health – for instance, running as opposed to walking for physical functioning. Items and response options can be written in the other direction so that “more difficult” requires “worse health” but here the opposite is assumed.

For a polytomous item, the meaning of the difficulty or severity parameter depends on the model used and represents a set of values for each item. For the graded response model, a type of two-parameter model which allows for item discrimination as well as difficulty to vary across items, the difficulty parameter associated with a particular category k of an item reflects the level of the attribute at which patients have 50% chance of scoring a category lower than k vs. category k or higher. For the partial credit model, a generalization of the one-parameter (Rasch) IRT dichotomous model where all items have equal discrimination, the difficulty parameter is referred to as the threshold parameter that reflects the level of the attribute where the probability of a response in either one of two adjacent categories is the same.

A one-parameter IRT model (“Rasch” model) for dichotomous items can be written as follows: Pi(X=1∣Θ)=e(Θ-bi)1+e(Θ-bi), where Pi(X = 1|Θ) is the probability with which a randomly selected respondent on the latent trait with level Θ (the Greek letter “theta”) will endorse item i and bi is the item difficulty (severity) parameter. In the one-parameter IRT model, each item is assumed to have the same amount of item discrimination.

In a two-parameter IRT model, an item discrimination parameter is added to the model. A two-parameter model for a dichotomous item can be written as follows:

Pi(X=1∣Θ)=eDai(Θ-bi)1+eDa i(Θ-bi),

where D is a scaling constant (D = 1.7 represents the normal ogive model), ai is the discrimination parameter, and the other variables remain the same as before. An important feature of the two-parameter model is that the distance between an individual’s trait level and an item’s severity has a greater impact on the probability of endorsing highly discriminating items than on less discriminating items. In particular, more discriminating items provide more information (than do less discriminating items) and even more so when a respondent’s level on the latent attribute is closer to an item’s location of severity.

Item Characteristic Curve

The item characteristic curve (ICC) is the fundamental unit in IRT and can be understood as the probability of endorsing an item (for a dichtomous response) or responding to a particular category of an item (for a polytomous response) for individuals with a given level of the attribute. In the latter case, the ICC is sometimes referred to as a category response curve. Depending on the IRT model used, these curves indicate which items (or questions) are harder or more difficult and which items are better discriminators of the attribute.

For example, if the attribute were mental health, the person with better mental health (here assumed to have higher levels of Θ) would be more likely to respond favorably to an item that assesses better mental health (an item with a higher level of “difficulty” needed to achieve that better state of mental health). If an item were a good discriminator of mental health, the probability of a positive response to this item (representing better mental health) would increase more rapidly as the level of mental health increases (larger slope of the ICC); given higher levels of mental health, the (conditional) probability of a positive response would increase noticeably across these higher levels. The various IRT models, which are variations of logistic (i.e., non-linear) models, are simply different mathematical functions for describing ICCs as the relationship of a person’s level on the attribute and an item’s characteristics (e.g., difficulty, discrimination) with the probability of a specific response on that item measuring the same attribute.

Category Response Curves

In IRT models a function, analogous to an ICC for a dichotomous response, can be plotted for each category of an item with more than two response categories (i.e., polytomous response scale). Such category response curves help in the evaluation of response options for each item by displaying the relative position of each category along the underlying continuum of the concept being measured. The ideal category response curve is characterized by each response category being most likely to be selected for some segment of the underlying continuum of the attribute (the person’s location on the attribute), with different segments corresponding to the hypothesized rank order of the response options in terms of the attribute or concept.

Figure 3 below shows an example category response curve for an item in the PROMIS 4-item general health scale: In general, how would you rate your physical health?15 The item has five response options: poor, fair, good, very good, excellent. The x-axis of Figure 3 shows the estimated physical health score (theta, Θ) depicted on a z-score metric with a more positive score representing better physical health. The y-axis shows the probability of selecting each response option. The figure shows that the response options for the item are monotonically related to physical health as expected and each response option is most likely to be selected at some range of the underlying construct (i.e., physical health).

What does X represent in the classical test theory equation?

Category Response Curves for PROMIS Physical Health Summary Score from Grade Response Model

Note: The labels indicate the five response options (poor, fair, good, very good, excellent) for the item. The probability of picking each of the response option by estimated physical health (theta) is based on data from the Patient-Reported Outcomes Measurement Information System (PROMIS®) project: http://www.nihpromis.org/

Item Information

Item information provides an assessment on the precision of measurement of an item for distinguishing among subjects across different levels of the underlying concept or attribute being measured (Θ); higher information implies more precision. Item information depends on the item parameters. For the two-parameter dichotomous logistic model, the item information function [I(Θ)] for item i at a specific value of Θ is equal to I(Θ)=(ai2)Pi(1-Pi), where Pi is the proportion of people with a specific amount of the attribute who endorse item i. For a dichotomous item, item information reaches its highest value at Θ = b, which occurs when the probability of endorsing the ith item is Pi = 0.5. The amount of item information (precision) decreases as the item difficulty differs from the respondent’s attribute level and is lowest at the extremes of the scale (i.e., for those scoring very low or very high on the underlying concept).

Item information sums together to form scale information. Figure 4 shows scale information for the PROMIS Physical Health scale. Again, the x-axis shows the estimated physical health score (theta, Θ) depicted on a z-score metric with a more positive score representing better physical health. The left-hand side of the y-axis shows the information and the right-hand side of the y-axis shows the standard error (SE) of measurement. The peak of the curve shows where the physical health measure yields the greatest information about respondents (in the z-score range from −2 to −1). The standard error of measurement is inversely related to, and a mirror image of, information (see below for further discussion of the relationships among information, reliability and the standard error of measurement).

What does X represent in the classical test theory equation?

Scale Information Curve for PROMIS Physical Health Score

Note: Information and standard error of measurement are shown for the Patient-Reported Outcomes Measurement Information System (PROMIS®) 4-item physical health scale.

The item information curve is peaked, providing more information and precision, when the a parameter (the item discrimination parameter) is high; when the a parameter is low the item information curve is flat. If an item information curve has a slope of a = 1, it has four times the discriminating ability than an item with a = 0.5 (as seen by the squared term in the item information function given in the previous paragraph). The value of the a parameter can be negative; however, this results in a monotonically decreasing item response function. This implies that people with high amounts of the attribute have a lower probability of responding in categories representing more of the concept than people with lower amounts of the attribute. Such bad items should be weeded out of an item pool, especially when the parameter estimated has sufficient precision (i.e., based on a sufficient sample size).

Reliability for measures scored on a z-score metric for the person attribute parameter Θ (mean = 0 and SD = 1) is equal to 1-SE2, where SE = 1/(information)1/2 and SE represents the standard deviation associated with a given Θ. So information is directly related to reliability; for example, information of 10 is equivalent to reliability of 0.90. In addition to estimates of information for each item, IRT models yield information for the combination of items such as a total scale score. Information typically varies by location along the underlying continuum of the attribute (i.e., for people who score low, in the middle, and high on the concept).

In Rasch measurement, the person separation index is used as a reliability index, because reliability reflects how accurately or precisely the scores separate or discriminate among persons; it is a summary of the genuine person separation relative to such separation and also measurement error.9 Measurement error consists of both random error and systematic error and represents the discrepancy between scores obtained and their corresponding true scores. The person separation index is based on the basic definition of reliability from classical test theory, as the ratio of true score variance to observed variance (which equals the true score variance plus the error variance). As noted earlier, the level of measurement error is not uniform across the range of a scale and is generally larger for more extreme scores (low and high scores).

Person-Item Map

It is common to fix the mean of the item difficulties to equal 0. If the PRO measure is easy for the sample of persons, the mean across person attributes will be greater than zero (Θ > 0); if the PRO measure is hard for the sample, the mean of Θ will be less than zero (Θ < 0). Those most comfortable with the Rasch model (one-parameter model) produce person-item (or Wright) maps to show the relationship between item difficulty and person attribute. In principle, these maps can illuminate the extent of item coverage or comprehensiveness, the amount of redundancy, and the range of the attribute in the sample.

If the items have been written based on a construct map (a structured and ordered definition of the underlying attribute, as measured by the hierarchical positing of its series of items intended to be measured by the scale and conceived of in advance), the item map that follows the construct map can be used as evidence congruent with content validity.16 A construct map is informed by a strong theory of which set of items require higher levels of the attribute for endorsement.

Figure 5 portrays such a person-item map of a 10-item scale on physical functioning.9 Because of the scale content, person attribute here is referred as person ability. With a recall period of the past 4 weeks, each item is pegged to a different physical activity but raises the same question: “In the past 4 weeks, how difficult was it to perform the following activity?” Each item also has the same set of five response options: 1 = extremely difficult, 2 = very difficult, 3 = moderately difficult, 4 = slightly difficult, 5 = not difficult. This example assumes that all activities were attempted by each respondent during the last 4-week recall interval.

What does X represent in the classical test theory equation?

Illustration of person-item map on physical functioning.

Note: M = mean of person distribution or item distribution, S = single standard deviation from the person mean or the item mean, T = two standard deviations from the person mean or the item mean. Source: Cappelleri et al.9

Also assume that item difficulty (severity) emanated from the rating scale model, a polytomous Rasch model, where each item has its own difficulty parameter separated from the common set of categorical threshold values across items.17 If the more general partial credit model was fit instead, the mean of the four category threshold parameters for each item could be used to represent an item’s difficulty. If the response option were binary instead of ordinal, a one-parameter (Rasch) binary logistic model could have been fit to obtain the set of item difficulties and attribute values.

At least three points are noteworthy. First, the questionnaire contains more easy items than hard ones, as seven of the 10 items have location (logit) scores on item difficulty (severity) below 0. Second, some items have the same difficulty scores and not much scale information would be sacrificed if one of the dual items and two of the triplet items were removed. Third, patients tend to cluster at the higher end of the scale (note that the mean location score is about 1 for the ability of persons and it exceeds the fixed mean location of 0 for difficulty of items), indicating that most of these patients would be likely to endorse (or respond favorably) to several of these items. Thus, this group of patients had either a high degree of physical functioning or, consistent with the previous evaluation of the items, there are not enough challenging or more difficult items such as those on moderate activities (such as moving a table, pushing a vacuum cleaner, bowling, or playing golf).

It should be noted that ICCs, which cannot cross in the one-parameter Rasch model, can cross in the two-parameter and three-parameter models when the discrimination parameters among items vary, which can confound the item ordering.9 For two- and three-parameter models, a consequence of this is that Item 1 might be more difficult than Item 2 for low levels of the attribute, whereas Item 2 might be more difficult than Item 1 for high levels of the attribute. In such a case the item ordering will not correspond in the same way as the item difficulty parameters. The Rasch model, which assumes equal item discrimination, corresponds exactly to (and is defined by) the order of item difficulties and hence the order of the items is constant throughout levels of the attribute.

IRT Assumptions

Prior to estimating an IRT model, it is important to evaluate its underlying assumptions. Two of the assumptions (monotonicity and unidimensionality) are often evaluated as part of classical test theory scale evaluation, as noted previously. The assumption of monotonicity, which relates to the assumption of correct model specification, is met if the probability of endorsing each response category increases with the person’s location on the attribute and the categories representing greater levels of the attribute require higher levels of the attribute in order to have a higher probability of being selected.

The assumption of unidimensionality for items in a scale is made so that a person’s level on the underlying construct accounts fully for her responses to the items in the scale; it occurs when the correlation is absent or trivial between any pair of items for fixed or given level of the attribute (also known as local independence). To satisfy this assumption, one can fit a factor-analytic model to the data to determine the extent to which there is sufficient unidimensionality. If the model fits the data well and there are no noteworthy residual correlations (i.e., no such correlations greater than or equal to 0.20), it provides support for the unidimensionality of the items in the scale.

Sample Size

Item response theory models, especially two- and three-parameter models, usually require large samples to obtain accurate and stable parameters; although, the one-parameter (Rasch) model may be estimable with more moderate samples. Several factors are involved in sample size estimation and no definitive answer can be given.9

First, the choice of IRT model affects the required sample size. One-parameter (Rasch) models involve the estimation of the fewest parameters and thus smaller sample sizes are needed, relative to two-parameter and three-parameter models, in order to obtain stable parameter estimates on item difficulty and person location.

Second, the type of response options influences the required sample size. In general, as the number of response categories increases, a larger sample size is warranted, as more item parameters must be estimated.

It has been suggested that sample sizes of at least 200 are needed for the one- parameter (Rasch) IRT model for binary items.18 At this sample size, standard errors of item difficulty are in the range of 0.14–0.21 [based on (2/(square root of n) < standard error < 3/(square root of n), where n is the sample size].19 Another suggestion is that, for item difficulty (and person measure) calibration to be within one logit of a stable value with 95% confidence, an expected sample size as small as 30 subjects would suffice in a Rasch model for dichotomous items (a larger sample size is needed for polytomous items).20 To be within one logit of a stable value for a binary item targeted to have probability of endorsement of 50% means that the true probability of endorsing the item can be as low as 27% and as high as 73%, a wide range. The challenge with Rasch analysis is that, because it requires model-based estimation, it needs a sizeable sample in order to yield stable results. This reason has led some researchers to conclude that results of Rasch analyses on small sample sizes have the potential to be misleading and is therefore not recommended.21 For two-parameter (e.g., graded response) models, a sample size of at least 500 is recommended.18 While at least 500 is ideal, a much smaller sample could still provide useful information depending on the scale’s properties and composition.

In general, the ideal situation is to have adequate representation of respondents for each combination of all possible response patterns across a set of items, something that is rarely achieved. It is important, though, to have at least some people respond to each of the categories of every item to allow the IRT model to be fully estimated.

Third, study purpose can affect the necessary sample size. A large sample size is not needed to obtain a clear and unambiguous picture of response behavior or trends – and therefore a large sample size is not generally needed in the instrument development stage to demonstrate content validity – provided that a heterogeneous sample is obtained that accurately reflects the range of population diversity inherent in item and person responses. If the purpose is to obtain precise measurements on item characteristics and person scores, with stable item and person calibrations, sample sizes in the hundreds are generally required; one recommendation is sample sizes over 500.22

Fourth, the sample distribution of respondents is another important consideration. Ideally, respondents should be spread fairly uniformly over the range of the attribute (construct) being measured. If fewer people are located at the ends of the attribute, items also positioned at the extreme end of the construct will have higher standard errors associated with their parameters.

Fifth, measures with more items may require larger sample sizes. Additional items increase the possibility that the parameters of any one item need a larger sample in order to be adequately estimated.

Finally, if the set of items in a questionnaire has a poor, or merely a modest, relationship with the attribute – which is not unexpected during the content validity stage of instrument development – a larger sample size would be needed as more information is needed to compensate for the smaller size of the relationship. If the relationship of items with the attribute is small, however, the credibility of the scale or at least some of its items should be called into question.

Discussion and Conclusion

Classical test theory and item response theory provide useful methods for assessing content validity during the early development of a PRO measure. Item response theory requires several items so that there is adequate opportunity to have a sufficient range for levels of item difficulty and person attribute. Single-item measures, or too few items, are not suitable for IRT analysis (or, for that matter, for some analyses in classical test theory). For IRT and classical test theory, each item should be distinct from the others yet should be similar and consistent with them in reflecting all important respects of the underlying attribute or construct.23 For example, a high level of cognitive ability (the attribute of interest) implies that the patient also has high levels of items constituting cognitive ability such as vocabulary, problem solving, mathematical ability, and other items indicative of high cognitive ability.

Item response theory can provide information above and beyond classical test theory but estimates from IRT models require an adequate sample size. Sample size considerations for IRT models are not straightforward and depend on several factors, such as the number of items and response categories.9 Small sample sizes should be discouraged when fitting IRT models because their model-based algorithms are not suited for small samples. Rasch models involve the estimation of the fewest parameters and thus smaller sample sizes are needed, relative to two-parameter models, in order to obtain stable parameter estimates on item difficulty and person location. That said, even Rasch models require a large enough sample to achieve reliable results for deciding what items to include and what response categories to revise.21

If a researcher has a small qualitative data and wants to get preliminary information about the content validity of the instrument, then descriptive assessments using classical test theory should be the first step. As the sample size grows during subsequent stages of instrument development, confidence in the numerical estimates from Rasch and other IRT models (as well as those of classical test theory) would also grow. In later stages of PRO development, researchers could strive for a sample of, say, 500 individuals for full psychometric testing. If the construct of interest is well-defined and responses are sufficiently dispersed along the trait continuum, significantly smaller sample sizes may be sufficient.

A Rasch model may be more amenable for the developmental stages of PRO measures than other IRT models, because of its item and person fit indices, person-item map, and smaller sample size requirement. Compared with classical test theory, a Rasch model (and other IRT models) provides the distinct benefit of a person-item map. The visual appeal of this map enriches understanding and interpretation in suggesting to what extent the items cover the targeted range of the underlying scale and whether the items align with the target patient population.

In summary, this paper presents an overview of classical test theory and item response theory in the quantitative assessment of items and scales during the content validity phase of patient-reported outcome measures. Depending on the particular type of measure and the specific circumstances, either approach or both approaches may be useful to help maximize the content validity of PRO measures.

Acknowledgments

The authors gratefully acknowledge comments from Dr. Stephen Coons (Critical Path Institute) on earlier drafts of this paper and also the comprehensive set of comments from two anonymous reviewers, all of which improved the quality of the paper.

Footnotes

Conflict of Interest Statement

JCC is an employee and a shareholder of Pfizer Inc. The opinions expressed here do not reflect the views of Pfizer Inc or any other institution. RDH was supported in part by funding from the Critical Path Institute and grants from AHRQ (2U18 HS016980), NIA (P30AG021684) and NIMHD (2P20MD000182). The Critical Path Institute itself was not involved with the intellectual content of this manuscript, the writing of the manuscript, or the decision to submit the manuscript for publication. JJL declares no conflict of interest.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1. Food and Drug Administration (FDA) Guidance for industry. Patient-reported outcome measures: use in medical product development to support labeling claims. Federal Register. 2009;74(235):65132–3. [PMC free article] [PubMed] [Google Scholar]

2. Patrick DL, Burke LB, Gwaltney CJ, Leidy NK, Martin ML, Molsen E, Ring L. Content validity - establishing and reporting the evidence in newly developed patient-reported Outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force Report: Part 1 - eliciting concepts for a new PRO instrument. Value in Health. 2011;14:967–977. [PubMed] [Google Scholar]

3. Patrick DL, Burke LB, Gwaltney CJ, Leidy NK, Martin ML, Molsen E, Ring L. Content Validity - establishing and reporting the evidence in newly developed patient-reported Outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force Report: Part 2- assessing respondent understanding. Value in Health. 2011;14:978–988. [PubMed] [Google Scholar]

4. Strauss ME, Smith GT. Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology. 2009;5:1–25. [PMC free article] [PubMed] [Google Scholar]

5. Anastasi A, Urbina S. Psychological Testing. 7. Upper Saddle River, New Jersey: Prentice Hall; 1997. [Google Scholar]

6. Hays RD, Farivar SS, Liu H. Approaches and recommendations for estimating minimally important differences for health-related quality of life measures. Journal of Chronic Obstructive Pulmonary Disease. 2005;2:63–67. [PubMed] [Google Scholar]

7. Hays RD, Wang E. Multitrait Scaling Program: MULTI. Proceedings of the Seventeenth Annual SAS Users Group International Conference; 1992. pp. 1151–1156. [Google Scholar]

8. Hays RD, Fayers P. Evaluating multi-item scales. In: Fayers P, Hays RD, editors. Assessing Quality of Life in Clinical Trials: Methods and Practice. 2. Oxford: Oxford University Press; 2005. pp. 41–53. [Google Scholar]

9. Cappelleri JC, Zou KH, Bushmakin AG, Alvir JMJ, Alemayehu D, Symonds T. Patient-Reported Outocmes: Measurements, Implementation and Interpretation. Boca Raton, Florida: Chapman & Hall/CRC Press; 2014. [Google Scholar]

10. Thissen D, Wainer H. Some standard errors in item response theory. Psychometrika. 1982;47:397–412. [Google Scholar]

11. Tabachnick BG, Fidell LS. Using Multivariate Statistics. 3. New York: Harper Collins; 1996. [Google Scholar]

12. MacCallum RC, Widaman KF, Zhang S, Hong S. Sample size in factor analysis. Psychological Methods. 1999;4:84–99. [Google Scholar]

13. Brown TA. Confirmatory Factor Analysis for Applied Research. New York, NY: The Guilford Press; 2006. [Google Scholar]

14. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Medical Care. 2000;38(Suppl):II-28–II-42. [PMC free article] [PubMed] [Google Scholar]

15. Hays RD, Bjorner JB, Revicki DA, Spritzer KL, Cella D. Development of physical and mental health summary scores from the Patient-Reported Outcomes Measurement Information System (PROMIS) global items. Quality of Life Research. 2009;18:873–80. [PMC free article] [PubMed] [Google Scholar]

16. Wilson M. Constructing Measures: An Item Response Modeling Approach. Mahwah, New Jersey: Lawrence Erlbaum Associates; 2005. [Google Scholar]

17. Bond TG, Box CM. Applying the Rasch Model: Fundamental Measurement in the Human Sciences. 2. Mahwah, New Jersey: Lawrence Erlbaum Associates; 2007. [Google Scholar]

18. Reeve B, Fayers P. Applying item response theory modelling for evaluating questionnaire item and scale properties. In: Fayers P, Hays RD, editors. Assessing Quality of Life in Clinical Trials. 2. New York, NY: Oxford University Press; 2005. pp. 53–73. [Google Scholar]

19. Wright BD, Stone MH. Best Test Design. Chicago, IL: MESA Press; 1979. [Google Scholar]

20. Linacre M. Sample size and item calibration [or person measure] stability. Rasch Measurement Transactions. 1994;7(4):328. (2010) Available from: http://www.rasch.org/rmt/rmt74m.htm. [Google Scholar]

21. Chen WH, Lenderking W, Jin Y, Wyrwich KW, Gelhorn H, Revicki DA. Is Rasch model analysis applicable in small sample size pilot studies for assessing item characteristics? An example using PROMIS pain behaviour item bank data. Quality of Life Research. 2013 Aug; Epub ahead of print. [PubMed] [Google Scholar]

22. Embretson SE, Reise SP. Item Response Theory for Psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates; 2000. [Google Scholar]

23. Fayers FM, Machin D. Quality of Life: The Assessment, Analysis and Interpretation of Patient-reported Outcomes. 2. Chichester, England: John Wiley & Sons Ltd; 2007. [Google Scholar]

What is classical test theory equation?

Because random error is always present to at least a minimum extent, the basic formulation in classical test theory is that the observed score is equal to the true score that would be obtained if there were no measurement error plus a random error component, or X = t + e, where X is the observed score, t is the true ...

What is classical test analysis a measure of?

Classical test theory (CTT) is an approach to measurement that considers the relationship between the expected score (or “true” score) and observed score on any given assessment.

Which of the following represents the classical theory assumption?

Classical test theory assumes linearity—that is, the regression of the observed score on the true score is linear. This linearity assumption underlies the practice of creating tests from the linear combination of items or subtests.

What is classical test theory explain?

Classical test theory, also known as true score theory, assumes that each person has a true score, T, that would be obtained if there were no errors in measurement. A person's true score is defined as the expected score over an infinite number of independent administrations of the scale.