Short and Precise Patient Self-Assessment of Heart Failure Symptoms Using a Computerized Adaptive TestClinical Perspective
Background—Assessment of dyspnea, fatigue, and physical disability is fundamental to the monitoring of patients with heart failure (HF). A plethora of patient-reported measures exist, but most are too burdensome or imprecise to be useful in clinical practice. New techniques used for computer adaptive tests (CATs) may be able to address these problems. The purpose of this study was to build a CAT for patients with HF.
Methods and Results—Item banks of 74 queries (“items”) were developed to assess self-reported physical disability, fatigue, and dyspnea. All queries were administered to 658 adults with HF to build 3 item banks. The resulting HF-CAT was administered to 100 patients with ancillary HF (New York Heart Association I, 11%; II, 53%; III and IV, 36%). In addition, the physical function and vitality domains of the SF-36 Health Survey questionnaire, an established shortness-of-breath scale, and the Minnesota Living with Heart Failure Questionnaire were applied. The HF-CAT assessment took 3:09±1:52 minutes to complete and score. All HF-CAT scales demonstrated good construct validity through high correlations with the corresponding SF-36 Health Survey physical function (r=−0.87), vitality (r=−0.85), and shortness-of-breath (r=0.84) scales. Simulation studies showed a more precise measurement of all HF-CAT scales over a larger range than comparable static tools. The HF-CAT scales identified significant differences between patients classified by the New York Heart Association symptom criteria, similar to the Minnesota Living with Heart Failure Questionnaire.
Conclusions—A new CAT for patients with HF was built using modern psychometric methods. Initial results demonstrate its potential to increase the feasibility and precision of patient self-assessments of symptoms of HF with minimized respondent burden.
Clinical Trial Registration—URL: http://www.projectreporter.nih.gov. Unique identifier: 1R43HL083622-01.
The cardinal manifestations of heart failure (HF) are dyspnea and fatigue, limited tolerance of physical activity, fluid retention, pulmonary congestion, and peripheral edema. Therefore, HF is a clinical diagnosis that is largely based on physical examination and a careful history about typical subjective symptoms in the presence of cardiac dysfunction.1 A patient-centered measurement approach is particularly important in HF, to provide clinicians with tools to help them to monitor the syndrome, to compare improvements under different forms of therapy, and to identify risk of deterioration. The New York Heart Association (NYHA) classification has been used for this purpose, but it is being criticized for its questionable reliability2,3 and is rarely used outside clinical studies or specialized units.
Clinical Perspective on p 339
Generally, patient self-assessments have been the more reliable assessments of subjective symptoms, which is one reason for a growing interest in subjective health status measures from the scientific community, clinical practitioners, and industry.4,5 Self-assessed symptoms are used to predict declines in health status of patients with HF,6 total expenses for HF care,7 hospitalization, or even mortality.8,9 Their widespread use has been recommended to increase quality of care,10 and 30% of all new drug developments use patient-reported outcomes (PROs) as their primary or coprimary end point.11
However, with traditional methods, a comprehensive and reliable “static” measure is likely to be long and time-consuming to administer and score. If questionnaire data need to be analyzed manually, assessments become cost prohibitive for use in routine clinical practice, and individual patient reports cannot be provided in a timely manner. Short forms limit the respondent burden but often show more ceiling or floor effects and lack the precision required at the individual patient level.12,13 Measurement precision to guide individual decision making must be substantially higher than for group comparisons, because true change must be separated from measurement error for every single assessment.13 For example, if a CI of 95% is required, a traditional tool with good psychometric properties for group comparisons (eg, Cronbach α=0.80) would only allow for interpretation of score differences of almost 1 SD when used for an individual.14 Moreover, classic psychometric methods cannot be used to determine the measurement precision for an individual measurement. As a result, none of the existing tools has become a standard measure in clinical practice.15,16 Enhancing the precision, accessibility, and interpretability of PRO measures could make HF management more efficient and effective in meeting patient care needs.
With the presented study, we apply computerized adaptive testing (CAT) methods, a measurement technology17 that is used widely in educational testing.18 We aimed to build a system that will allow routine, comprehensive assessment of pathognomonic symptoms. The use of CAT techniques also promises to provide more precise measures, with fewer items, and an effective resolution to the classic conflict between practicality and precision faced by traditional measurement methods.12 The CATs tailor each assessment to the individual's status on what is being measured, applying only items that are most appropriate for her or his current health status. Responses to each CAT item direct the choice of the following CAT item toward the most informative for this particular assessment. A patient indicating higher levels of disability within the first questions would only be asked about this level of ability. Omitting the use of uninformative items not relevant for a given functional limitation focuses the assessment, decreases the respondent burden, and increases the measurement precision achievable with a given number of items.
The CATs select the items out of a larger item bank representing the entire range of the construct being measured. Most of the item banks are built on the principles of the Item-Response Theory (IRT). The National Institutes of Health are intensively promoting the use of these methods to develop a comprehensive Patient-Reported Outcomes Measurement Information System (PROMIS) as part of their roadmap initiatives (http://nihroadmap.nih.gov/). The authors of this article are part of the PROMIS initiative, which aims to provide a standard assessment for generic health status measures in the near future.19
The goal of this study was to develop CATs for dyspnea, fatigue, and physical function for the assessment of patients with HF and to evaluate their acceptability, precision, and validity.
Development of the Items
After review of the relevant literature, we developed a set of 74 patient questions (items) covering the 3 primary physical impairments commonly reported by patients with HF: physical function/disability (24 items), dyspnea (30 items), and vitality/fatigue (20 items). The queries were designed to be short enough to fit on a portable telephone screen for home assessments (Figure 1). Items were selected to represent the entire continuum of each aspect of HF from no to severe impairment. All 3 item banks have been scored in the direction that higher scores indicate more impairment (ie, physical disability, fatigue, and dyspnea).
The item bank development was performed separately for each of the 3 domains of physical function, dyspnea, and fatigue, following the same procedures as described in previous studies.20,21 After the item banks were developed, we used them as a basis for a CAT. A new software solution was developed to work on a Personal Digital Assistant. The CAT logic can be set to stop after the measurement reaches a particular precision or after a maximum of items is administered. For this study phase, the CAT was set to assess each of the 3 different domains with an SE <3.3 (corresponding to a reliability of Cronbach α >0.90 for samples with an SD of 10) or a maximum of 7 items per scale.
The data for the CAT item bank development sample were collected via the Internet from English-speaking adults with HF. All respondents were recruited by YouGov. YouGov uses a method called sample matching for the selection of study samples from pools of opt-in respondents.22 Sample matching starts with an enumeration of the target population. For patient recruitments, the target population is all adults with similar sociodemographic characteristics, such as patients with a particular condition, as enumerated in consumer databases (eg, maintained by Acxiom, Experian, and InfoUSA). Then, a random sample is drawn from the target population. Finally, for each member of the target sample, a matching member of the Internet pool of opt-in respondents is selected, resulting in a “matched sample.” Matching was based on age, sex, and race. The resulting matched sample has similar characteristics to the target population and will have similar properties to a true random sample. For this study, 14 028 adults were approached until the target number of patients with HF was enrolled. All newly developed items were administered randomly.
The same data collection method and vendor were used for many similar projects, including an National Institutes of Health roadmap initiative for the development of generic PRO tools (http://www.nihpromis.org). To ensure a sufficient distribution of responses for the item parameter estimation, we used a quota of one third of patients with minor, medium, and severe impairment based on 1 screening question describing the level of impairment analogous to the NYHA classification (I, II, and ≥III).
To help ensure the quality of the data, we applied the following exclusion criteria: (a) average answering time per item was <5 seconds, (b) subjects who did not indicate they had HF and 1 underlying cause for HF, (c) subjects who did not indicate that the HF diagnosis was given by a physician, (d) last visit to a physician was >6 months ago, or (e) current medication did not indicate at least 1 drug used for the treatment of HF (diuretics, angiotensin-converting enzyme inhibitor or angiotensin II receptor blocker , β-blockers, or digoxin).
To examine the characteristics of the HF-CAT, different simulation studies were conducted, as previously described.20,23 These analyses are based on the real data provided for all items in the bank by the patients in the online survey. Only small subsets of those item responses are used to estimate the patient score for the CAT simulation (in IRT terms, called “θ score”). The quality of the items in the bank defines the precision of the score at different ranges. The “test information curve” identifies floor and ceiling effects and if the measurement range of the tool fits to the symptoms of the sample. To illustrate this for the HF-CAT, the precision of the score estimate was plotted as a function of the patient scores.20
To evaluate the construct validity of the HF-CAT, items from the following established tools were also included in the data collection: the SF-36 Health Survey scales for physical functioning and vitality,24 4 items from the Medical Health Outcomes Survey to assess shortness of breath,25 and the Minnesota Living with Heart Failure Questionnaire26 (MLHFQ, 21 items) as a legacy tool for measuring HF as indicated by patients' perceptions of its overall effects on their lives.
A separate sample of 100 consecutive participants was recruited for the validity test conducted at the HF clinic of the Montefiore Medical Center, Bronx, NY. The clinic was selected because it usually does not use PRO assessments and predominantly serves a low-income diverse population. We considered this environment as particularly challenging to test a new technology, assuming relatively low health literacy levels. In addition, we believed that an evaluation of psychometric properties would be more relevant in a less educated sample, because the validity of the IRT assumptions has been already evaluated in the developmental sample, which was affluent and well educated (Table 1). Patients with previously diagnosed HF were invited to participate in the study. Consenting participants were asked to complete the actual HF-CAT on a hand-held computer (personal digital assistant [PDA]) and a series of paper-and-pencil assessments, including sociodemographic questions, the MLHFQ, and a survey evaluation of the experience with the HF-CAT. All participants completed both instruments. Participants were randomly assigned to 1 of 2 groups within a crossover design, in which the order of presentation of the HF-CAT assessment and the MLHFQ was counterbalanced. Patients were placed in the waiting area and asked to follow the standard instructions provided for each measure.
Medical information, including NYHA class, was extracted from the medical files. The NYHA class is determined routinely for all patients at every visit at the Montefiore Medical Center Heart Failure Clinic based on the clinical assessment of the treating physician. The NYHA class was determined without knowledge of the results of patient self-assessments. Patients gave written informed consent and received a $25 incentive for their participation in the study.
After applying the inclusion and exclusion criteria, the final item bank development sample consisted of 658 participants, aged 60±13 years (49% female), who had experienced HF for 8.8±7.9 years (Table 1). Patients reported the following conditions in addition to their HF: 43%, coronary heart disease; 42%, previous heart attacks; 18%, cardiomyopathy; 14%, valvular heart disease; 5.2%, rheumatic fever; 60%, hypertension; 31%, arrhythmias; and 40%, diabetes. Alcohol abuse was reported by 5.9% of patients.
The Montefiore Medical Center clinical sample (n=100) was predominantly male (62%), with a mean age of 58 years. The sample was diverse, including mostly African American patients and many Hispanics. One third of the population had a comparatively low household income. The severity of their HF symptoms, assessed by the NYHA classification, was 11% in class I, 53% in class II, and 36% in class III or IV.
Item Banks Development
In the final calibrated item banks, there were 21 items assessing physical disability, 20 items assessing fatigue, and 29 items in the dyspnea bank with satisfactory item fit (Table 2). Most informative (ie, with a high discrimination parameter: “slope”) was the item asking about the ability to run errands, an item referring to a feeling of being “worn out,” and the item asking if the patient will be short of breath walking from one room to another.
The precision of every score estimate can be displayed as a function of the level of function or the severity of the symptoms. The results of the simulation studies showed that a highly precise score (comparable to an internal consistency of α>0.90) can be estimated with 5 items for each domain over a range of nearly 3 SDs (Figure 2, left).
The concordance between the results of the CATs and the entire item bank was good for all of the constructs, as illustrated by the extremely high correlations (r=0.95–0.97), showing that the 5-item CAT can essentially capture the information provided by the entire bank. As expected, there were high correlations between the simulated CAT scale scores and the corresponding SF-36 Health Survey's physical function (r=–0.87) and vitality (r=–0.84) scales, as well as the static shortness of breath measurement (r=0.83). Compared with all legacy tools, the HF-CAT provides a more precise measurement over a larger measurement range (Figure 2, right). For physical disability, a similar measurement precision, such as with the SF-36 physical function scale, can be achieved with ½ the number of items (Figure 2, top left).
On average, 4 to 5 items were administered for the assessment of physical disability, fatigue, and dyspnea to achieve the predefined level of precision (Table 3). The average time for administration of the entire HF-CAT with all 3 domains was 3 minutes (3±2 minutes).
We used the MLHFQ to help evaluate the constructs of the HF-CAT and the NYHA class to evaluate its discriminative validity (Table 3). The mean MLHFQ score of the sample was 38±25, and the mean scores of the HF-CAT were 59.6±8.4 for physical disability, 52.6±8.5 for fatigue, and 54.8±13.3 for dyspnea. There were no order effects for any measure. The HF-CAT scales for physical disability, fatigue, and dyspnea correlated significantly with the MLHFQ total score (r=0.71, r=0.63, and r=0.68, respectively).
A general linear model was used to evaluate the ability of the HF-CAT scales to statistically differentiate patients with different levels of symptom severity, as measured by the clinician's NYHA classification (Table 3). The main effects for all the measures were significant, with similar discriminative ability (Eta2, F values) for the HF-CAT physical disability and dyspnea scales and the MLHFQ scale.
Because this study took place in a low-income, less educated, minority population, we were particularly interested in the subjective user experience with a computer assessment. Of the patients, 98% found the HF-CAT assessment overall very easy or easy, 100% thought it was very easy or easy to follow the instructions, and 95% said it was very easy or easy to read the questions on the screen. In addition, of the patients, 98% judged the time for the assessment as “just right,” and 90% considered the questions as relevant; 98% were willing to use the device again on the next visit.
For the first time, to our knowledge, we applied computerized adaptive testing methods to develop and evaluate an ultrashort assessment system for patients with HF (HF-CAT) in clinical practice. The tool allows routine, comprehensive assessment of 3 primary problems that are commonly experienced by patients with HF. If the emotional or social impact of the disease is of additional interest, further tools (eg, from the PROMIS) need to be added for a comprehensive coverage of the health-related quality-of-life construct.
The feasibility of the HF-CAT in its PDA version was evaluated in a low-income, low-educated, minority population in the Bronx. The HF-CAT is a practical and well-accepted tool. Nevertheless, it was tested under study conditions, and participants might have been biased receiving an incentive for their participation. To our knowledge, only 1 report about the acceptance of CATs within clinical practice settings is available. A similar CAT, also being displayed on a PDA, has been in routine clinical use since 2004. Patients answering this CAT also report a high acceptability. Almost all of the 423 consecutive patients considered the handling as easy and believed that the use of the PDA made sense.27
Several other studies report about the reception of CATs under study conditions. Most patients in a feasibility test of a pain CAT found the CAT application to be useful, relevant, of appropriate length, and easy to complete.28 Similarly, most respondents in a feasibility study of an asthma impact CAT found it easy to complete and of appropriate length.29 The results of a feasibility test of a diabetes CAT gave somewhat mixed results. Although both English- and Spanish-speaking participants agreed that a paper-and-pencil assessment was more burdensome than a CAT, the Spanish-speaking participants preferred the paper tool and were more willing to complete a paper tool in the future.30
One important contribution of the CAT technology will be to reduce the respondent burden without compromising the precision and validity of the assessment, by tailoring each assessment to the patient's condition. This advantage was previously demonstrated in a simulation study of the Activities of Daily Living CAT, which found that the CAT provided similar results to a static version while reducing the number of items administered by 50%.31 Results from other studies indicate that scores similar to those obtained with full-length item banks (ranging in length from 18–585 items) can be achieved through much shorter CATs when measuring functional status,32–34 mental health status,21,27,35,36 or the impact of conditions, such as headache,23,37 diabetes,30 chronic pain,28 and asthma.29 Most actual CAT applications used between 5 and 7 items to measure 1 construct. The present HF-CAT applied between 4 and 5 items per scale, and the average total time for the entire assessment and scoring was 3 minutes (ie, 1 minute per scale, which could be applied individually). The assessment time of the MLHFQ electronically measured in a previous study was 4±2 minutes,38 and time to administer the Kansas City Cardiomyopathy Questionnaire, another common tool for the assessment of patients with HF, is reported to be 4 to 6 minutes without scoring.39
In summary, the HF-CAT provides a precise measure over a large measurement range with minimal respondent burden. As far as it is known today, it seems that CATs offer an effective resolution to the classic conflict between practicality and precision faced by traditional measurement technology.12
Studies of CAT applications in diseases, such as depression27,35 or headache,40 have shown that their measurement advantages can transfer to increased validity in identifying differences between groups known to differ in clinical characteristics, compared with static tools. The 3 scales of the HF-CAT also discriminated between groups of patients of different NYHA classification equally and a legacy tool measuring the impact of HF, using 4 times more items. These initial results show that the HF-CAT has the potential to provide a valid, highly relevant assessment of patients with HF.
For the assessment of patients with HF, we believe it is important to assess the health status of the patient at the point of care and at the patient's home. Because many elderly patients do not have access to the Internet or are not familiar with its use, one way to do so is the use of a “smart phone” and/or interactive voice recognition. Most established tools include items that are not suitable to be used over the telephone. The IRT methods allow using much simpler items over the telephone and more comprehensive items at the physician's office, and scoring both assessments on the same measurement metric. This allows having a smart phone administer the HF-CAT at the patient's home and having the same patient answering the more comprehensive PROMIS-CAT on a tablet PC at the physician's office. The IRT-based measurements of health outcomes are independent of the particular items being administered and of the test administrator. The same value for the same domain yields the same interpretation, whereas results from different traditional tools cannot be compared directly, making serial health status monitoring less practicable.
Despite many encouraging findings with recent CAT developments, several issues still need to be addressed. Within this study, we have only used outpatients to evaluate the HF-CAT, which limits the generalizability to less severely disabled patients. However, one of the most relevant advantages of CATs is that they can essentially eliminate floor and ceiling effects by applying items tailored to the test taker. Our simulation studies have shown that the current item bank covers >3 SDs above the population mean, which is where a hospitalized population of patients with HF usually scores.
We did not evaluate the test-retest reliability for the HF-CAT. Similarly, we have not used the HF-CAT in an intervention study to test its responsiveness to treatments. However, several studies have reported on the ability of other CATs to detect change. For example, in a telephone study of 540 patients with headache, a CAT for headache impact was more responsive to self-evaluated changes of headache impact than a corresponding 54-item bank.23 In a longitudinal, prospective cohort study of 94 patients discharged from inpatient rehabilitation, the CAT version of the Activity Measure for Post-Acute Care was comparable in responsiveness to the 66-item static version.41 Similarly, in a series of articles, Hart et al33,34 report on the results of validation studies of condition-specific CATs, using large data sets from patients receiving rehabilitation services across multiple US clinics.
In summary, we have developed a promising method to measure patient-reported dyspnea, fatigue, and physical function for use in the care of patients with HF. This new measure is part of a rapidly growing number of new assessment tools using the advantages of item response theory and computerized adaptive test techniques,16,19,42 with some of them being used in clinical practice already.27,43 However, whether these encouraging improvements in measurement will transfer to improved care and ultimately health of patients with HF warrants further studies.
Sources of Funding
This study was supported in part by grant 1 R43 HL083622-01 from the National Institutes of Health/National Heart, Lung, and Blood Institute (Dr Rose).
The HF-CAT software was developed by QualityMetric Inc. Dr. J. Bjørner is an employee of this company.
- Received June 28, 2011.
- Accepted March 20, 2012.
- © 2012 American Heart Association, Inc.
- Hunt SA,
- Baker DW,
- Chin MH,
- Cinquegrani MP,
- Feldman AM,
- Francis GS,
- Ganiats TG,
- Goldstein S,
- Gregoratos G,
- Jessup ML,
- Noble RJ,
- Packer M,
- Silver MA,
- Stevenson LW,
- Gibbons RJ,
- Antman EM,
- Alpert JS,
- Faxon DP,
- Fuster V,
- Jacobs AK,
- Hiratzka LF,
- Russell RO,
- Smith SC Jr.
- Goldman L,
- Hashimoto B,
- Cook EF,
- Loscalzo A
- Lett HS,
- Blumenthal JA,
- Babyak MA,
- Sherwood A,
- Strauman T,
- Robins C,
- Newman MF
- Burke L
- Wainer H,
- Dorans NJ,
- Eignor D,
- Flaugher R,
- Green BF,
- Mislevy RJ,
- Steinberg L,
- Thissen D
- Rubin DB
- Ware JE Jr.,
- Dewey J
National Committee for Quality Assurance. Specifications for the Medicare Health Outcomes Survey: HEDIS. 6. Washington, DC: National Committee for Quality Assurance; 2004.
- Rector T,
- Cohn J
- Hart DL,
- Werneke MW,
- Wang YC,
- Stratford PW,
- Mioduski JE
- Haley SM,
- Fragala-Pinkham M,
- Ni P
Patient-reported outcome measures can assist clinicians in monitoring the effectiveness of their treatment of patients with heart failure (HF), comparing improvements under different forms of therapy, and identifying risk of deterioration. There are several questionnaires available to measure typical HF symptoms. However, established questionnaires are often long or too imprecise for individual decision making. New computer adaptive tests (CATs) promise to provide more precise measures, with fewer items, and an effective resolution to the classic conflict between practicality and precision faced by traditional measurement methods. The CATs tailor each assessment to the individual's status on what is being measured, applying only items that are most appropriate for her or his current health status. We have developed a CAT for the assessment of 3 typical HF symptoms (HF-CAT): dyspnea, fatigue, and physical function. In-clinic tests confirmed the expected CAT advantages, including shorter surveys by eliminating questions not relevant to each patient, equal or better enumerations over a wide range of scores, and surveys that did not require a test administrator. In summary, we have developed a promising method to measure patient-reported symptoms for use in the care of patients with HF. This new measure is part of a rapidly growing number of new assessment tools using the advantages of item response theory and computerized adaptive test techniques.