|
|
||||||||
Statistical Concepts Series |
1 From the Department of Research, American College of Radiology, 1891 Preston White Dr, Reston, VA 20191 (J.H.S.); Riley Hospital for Children, Indiana University Medical Center, Indianapolis (K.E.A.); and Department of Diagnostic Radiology, Yale University, New Haven, Conn (J.H.S.). Received August 10, 2003; revision requested August 19; revision received and accepted August 21. Address correspondence to J.H.S. (e-mail: jonathans@acr.org).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2004
Index terms: Cancer screening Efficacy study Radiology and radiologists, outcomes studies Technology assessment
| INTRODUCTION |
|---|
|
|
|---|
The goal of this article is to describe some of the rationale and the methods of technology assessment as applied to radiology. For any health care intervention, including diagnostic imaging tests, the ultimate questions are, "How much does this do to improve the health of people?" and "How much does it cost for that gain in health?" We need such an understanding of the radiology services we provide to advocate for our patients and to use our resources efficiently and effectively.
| OUTCOMES |
|---|
|
|
|---|
The most productive way to think about this gap between diagnostic accuracy on the one hand and outcomes on the other hand and to think about the inclusion of relevant outcomes in the evaluation of diagnostic tests is to use the conceptual scheme of a six-level "hierarchy of efficacy" developed by Fryback and Thornbury (5,6) (Table). They point out that efficacy at any level in their hierarchy is necessary for efficacy at the level with the next highest number but is not sufficient. In their scheme, diagnostic accuracy is at level 2, and patient and societal outcomes are at levels 5 and 6, respectively. Thus, there may be "many a slip between cup and lip"that is, between diagnostic accuracy of an imaging test on the one hand and improved health and adequate cost-effectiveness on the other.
|
As the Table shows, there are multiple measures that can be used to quantify the efficacy of a diagnostic imaging test at any of the six levels. Hence, evaluations of imaging tests can involve a variety of measures. Thinking in terms of the hierarchy is also helpful for identification of the level(s) at which information should be obtained in an evaluation of a diagnostic imaging test. Experience, as well as reflection, has taught some lessons. The most important of these include:
1. Because higher-level efficacy is possible only if lower-level efficacy exists, it is often useful to measure efficacy at relatively low-numbered levels.
2. In particular, in the development of a test, it is helpful to measure aspects of technical efficacy (level 1), such as sharpness, noise level, and ability to visualize the anatomic structures of interest. An important aspect of test development consists of finding the technical parameters (voltage, section thickness, etc) that give the best diagnostic accuracy; these measures of technical efficacy are often key results in that process.
3. Diagnostic accuracy (level 2) is the highest level of efficacy that is characteristic of the test alone. For example, the sensitivity and specificity of a test are not dependent on what other diagnostic information is available, unlike level 3 (diagnosis). Also, the methodology and statistics used in measurement of diagnostic accuracy are relatively fully developed. Therefore, measurement of diagnostic accuracy is usually worthwhile.
4. Above diagnostic accuracy, effect on treatment (level 4), an "intermediate outcome," is relatively attractive to measure. It can be measured fairly easily and reliably in a prospective study, and it is closer in the hierarchy to the ultimate criteria, effect on patient health (level 5) and cost-effectiveness (level 6).
5. Effect on patient health (level 5) is usually observable only after a substantial delay, especially for chronic illnesses, such as cardiovascular disease and cancer, which are currently the predominant causes of mortality in the United States. Also, it is the end result of a multistep process of health care. Because diagnostic tests occur near the beginning of the process, and some random variation enters into the results at every step, the effect of a diagnostic test on final outcomes is usually difficult to observe without an inordinate number of patients. For example, the current principal randomized controlled trial of computed tomographic (CT) screening for lung cancer requires some 50,000 patients and is expected to take 8 years and cost $200 million (7). Thus, effects on patient health (level 5) and cost-effectiveness (level 6) are uncommon as end points in experimental studies on the evaluation of diagnostic tests.
| CLINICAL DECISION ANALYSIS AND COST-EFFECTIVENESS ANALYSIS |
|---|
|
|
|---|
Cost-effectiveness analysis recognizes that the results of care are rarely 0% and 100% outcomes but rather are probabilistic (14). It involves the creation of algorithms, usually displayed as decision trees, as shown in Figure 1, which incorporate probabilities of events and, often, the valuations (usually called "utilities") of possible outcomes of these events. Individual or population-based preferences for certain outcomes and treatments are factored into these utilities.
|
Defining the Problem
For any cost-effectiveness analysis, one of the most difficult tasks is defining the appropriate research question. The issues to address in defining the problem are the population reference case, strategies, time horizon, perspective, and efficacy (outcome) measures. The reference case is a description of the patient population the cost-effectiveness analysis is intended to cover. For example, the reference case for the cost-effectiveness analysis in Figure 1 consists of persons with acute abdominal pain seen in the emergency department.
The issue of strategies is, what are the care strategies that we should compare? Too many strategies may be confusing to compare. Too few may make an analysis suspect of missing possibly superior strategies. The decision tree in Figure 1 compares costs and outcomes of a clinical examination versus an imaging test for the diagnosis of acute appendicitis; in a fuller model, ultrasonography (US) and CT might be considered separate imaging strategies. In general, cost-effectiveness analysis and decision analysis address whether a new diagnostic test or treatment strategy should replace the current standard of care, in which case the current standard and the proposed new approach are the strategies to include. Alternatively, often the issue is which of a series of tests or treatments is best, and these then become the strategies to include.
The time horizon for which the cost-effectiveness analysis model is used to evaluate costs, benefits, and risks of each strategy must be stated and explained. Sometimes, the time horizon may be limited because of incomplete data, but this creates a bias against strategies with long-term benefits.
Finally, cost-effectiveness analysis allows costs to be counted from different perspectives. The perspective might be that of a third-party payer, in which case only insurance payments count as costs, or that of society, in which case all monetary costs, including those paid by the patient, count, and soat least in some analysesdo nonmonetary costs, such as travel and waiting time involved in obtaining care.
Building the Cost-Effectiveness Analysis Model
Cost-effectiveness analysis is usually based on a decision tree, a visual representation of the research question (Fig 1). These decision trees are created and analyzed with readily available computer software, such as DATA (TreeAge Software, Williamstown, Mass). The tree incorporates the choices, probabilities of events occurring, outcomes, and utilities for each strategy being considered. Each branch of the tree must have a probability assigned to it, and each path in the tree must have a cost and outcome assigned. Data typically come from direct studies of varying quality, from expert opinion (which is usually unavoidable because some needed data values can not be obtained in any other way), and from some less directly relevant literature. For example, in Figure 1, the probability of a positive test result may be selected from published literature and added to the decision tree under the branch labeled "Positive Test/Surgery." Costs are frequently not ascertained directly, but rather are estimated by using proxies such as Medicare reimbursement rates or the charge and/or cost data of a hospital. Building the decision tree requires experience and judgment.
The complexity of cost-effectiveness analysis sometimes makes it difficult to understand and therefore undervalued (14,15). One way to improve understanding and allow readers to judge for themselves the value of a cost-effectiveness analysis model is to be explicit about the assumptions of the model. Many assumptions are needed simply because of limited data available to answer the research question.
Analyzing the Cost-Effectiveness Analysis Model
Once the model has been created, analysis should then include baseline analysis of cost and effectiveness and sensitivity analysis. The average cost and effectiveness for each strategy, considering all the outcomes to which it might lead, are computed simultaneously. We calculate averages by weighting the end probabilities of each branch and by summing for each strategy by moving from right to left in the tree. In cost-effectiveness analysis decision trees such as that in Figure 1, the costs and utilities for each outcome would be placed in the decision tree at the right end of each branch.
Possible results when comparing two strategies include the following: One strategy is less expensive and more effective than another, one strategy is more expensive and less effective, one strategy is less expensive but less effective, and one strategy is more expensive but more effective. The choice in the first two situations is clear, and the better strategy is called "dominant." The final two situations involve trade-offs in cost versus effectiveness, however. In these situations, one compares strategies by using the incremental cost-effectiveness ratio, which allows evaluation of the ratio of increase in cost to increase in effectiveness. What maximal incremental cost-effectiveness ratio is acceptable is open to debate, but for the United States, $50,000$100,000 per year of life in perfect health (usually called a "quality-adjusted life-year") is commonly recommended as a maximum.
Almost all payers in the United States state that they consider only effectiveness, not cost. Implicitly, then, they accept an indefinitely high incremental cost-effectiveness ratioit does not matter how much more expensive a strategy is, as long as it is the least bit more effective or the public demands it intensely.
The final task in cost-effectiveness analysis is sensitivity analysis. Sensitivity analysis consists of changing "parameter values" (numerical values, such as probabilities, costs, and valuation of outcomes) in the model to find out what effect they have on the conclusions. A model should be tested in this way for "robustness," or strength of its conclusions with regard to changes in its assumptions and uncertainty in the parameters taken from the literature or expert opinion. If a small change in the value of a parameter leads to a change in the preferred strategy of the model, then the conclusion is said to be sensitive to that parameter, and the conclusion is weak. Sensitivity analysis may persuade doubtful readers of the soundness of the conclusions of the model by showing that the researchers were thorough and unbiased and the conclusions are not sensitive to the assumptions or parameters the readers question. Often, however, sensitivity analysis will show that conclusions are not robust. Alternatively, another cost-effectiveness analysis, conducted by different researchers by using different assumptions and parameters (which is really a form of sensitivity analysis), will reach different conclusions. While discouraging, a similar situation is not uncommon with experimental studies (such as clinical research), with one study having findings different from another. Also, identification of the parameters and assumptions to which the results are sensitive can be very helpful, because it tells researchers what needs to be investigated further through experimental studies to reach reliable conclusions.
| CHARACTERISTICS OF HIGH-QUALITY EXPERIMENTAL STUDIES |
|---|
|
|
|---|
The most important considerations follow. We focus on studies of diagnostic accuracy, since these are most common and constitute the principal focus of radiologists, but most of what is said applies to experimental studies of other levels of the hierarchy of efficacy.
Patient Characteristics
Patients in a study should be like those in whom a test will be applied in practice. Often, in initial studies, a test is applied predominantly to very sick patients or completely healthy individuals. This "spectrum bias" exaggerates the real-world ability of the test to distinguish disease from health because intermediate cases that are less than totally clear cut are eliminated. As a result, initial reports on a new test are often overly optimistic. On the other hand, such spectrum bias can be useful in initial studies to ascertain if a test has any possible promise and to help establish the operating parameters at which the test works best.
Number of Cases
The number of cases included in studies should be adequate. Almost always, the smaller the number of cases, the larger the minimum difference that can reliably be observed. Before a study is begun, a statistician should be asked to perform a power calculation to ascertain the number of cases required to detect, with desired reliability, the minimum difference regarded as clinically important. Often, the number of cases included in actual studies is inadequate (22). Such studies are referred to as "underpowered" and can lead to errors.
Design Considerations
Prospective studies are almost always preferable to retrospective studies. "Well begun is half done" carries a corollary that "poorly begun is hard to salvage." In a retrospective study, one has to work from someone elses design and data collection, and these are typically far from optimal from the standpoint of your purposes.
The temptation to include in the research everything that might be studied should be resisted, lest the study collapse from its own complexity.
Often, the purpose of a study is to compare two diagnostic testsfor example, to compare a proposed new test with an established one. In this situation, unless data on patient health outcomes and cost must be directly obtained, an optimal design consists of applying both tests to all study patients, with interpretation of each test performed while blinded to the results of the other. In contrast, the common practice of using "historical controls" to represent the performance of the established test is usually a poor choice. The patient population in the historical control may be different, and the execution of the historical series may not meet standards of current best practice.
Reference Standard
The reference standard (sometimes less formally called the "gold standard") needs to be chosen carefully. While a perfect reference standardone with 100% accuracyoften cannot be attained, it is important to do as well as possible. Methodologists routinely warn (4,22,24) that a reference standard that is dependent, even in part, on the test(s) being evaluated involves circular reasoning, and they say it is therefore seriously deficient, but they note that such standards are nonetheless not infrequently used.
Timing
Timing is important because diagnostic imaging is a field that is changing relatively rapidly. There is little point in undertaking a large-scale study when a new technique is in the initial developmental stage and is changing particularly rapidly; results will be obsolete before they are published. On the other hand, it is not wise to wait until a technique is fully mature because, by then, it will often be widely disseminated, making the study too late for its results to readily influence general clinical practice. Use of techniques that lead to rapid completion of a study, such as gathering data from multiple sites, is highly desirable because imaging evolves relatively rapidly.
Efficacy and Effectiveness
Most evaluations of diagnostic testsand of any other medical careare studies of efficacy, which is defined as results obtained under ideal conditions, such as those of a careful research project. Initially, efficacy is important to ascertain, but ultimately, one would want to know effectiveness, which is defined as results obtained in ordinary practice. Effectiveness is usually poorer than efficacy. For example, studies in individual academic institutionsthat is, efficacy studiesshowed that abdominal CT for patients suspected of having appendicitis significantly reduced the perforation rate and unnecessary surgery rate (25,26), but a study of essentially all hospital discharges in Washington statethat is, an effectiveness studyshowed no improvement in either rate between 1987 and 1998, a period when laparoscopy and cross-sectional imaging techniques, including CT, became widely available (27). The systematization necessary for an organized study tends to preclude observation of effectivenessthe study protocol ensures uniform application of the test with its parameters set at optimal levels, and people are generally more careful and consistent and do better when they know their activity is being observed (this is called the Hawthorne effect).
Figure 2 lists some additional important considerations for high-quality studies. Sunshine and McNeil (16) discuss the above considerations and those in Figure 2 in more detail.
|
| SCREENING |
|---|
|
|
|---|
The Test
Because the prevalence of disease in a screening population is very lowfor example, approximately one-half percent in screening mammographya screening test must be highly specific. Otherwise, false-positive findings will greatly outnumber true-positive findings (even at the relatively high 90%95% specificity rate for mammographyie, 5%10% recall ratefalse-positive findings outnumber true-positive findings by 1020 to 1), and the cost and morbidity of working up patients with false-positive findings will outweigh the gains from early detection in those with true-positive findings. Similarly, the cost and morbidity of the screening test itself (which apply to every patient screened) must be relatively low; otherwise, they will outweigh the gains of screening, which can occur only for the very small percentage of patients with true-positive findings.
In contrast, sensitivity can be modest. For example, screening mammography has an approximate 75% sensitivity, yet it allows us to identify three of every four possible breast cancers that could be detected if the test were perfectly (100%) sensitive. These requirements for a screening test can be somewhat eased if a high-risk population is identified, because the proportion of true-positive findings will increase. Note that while a screening test optimally has high specificity and may only need modest sensitivity, an optimal diagnostic test for symptomatic patients should have a high sensitivity, but the specificity may be modest.
Treatment
Oddly, the available treatment must be intermediate in efficacy. If treatment is fully efficaciousmore specifically, if treatment of symptomatic patients is as efficacious and no more costly than the presymptomatic treatment made possible by screeningthen nothing is to be gained by identifying disease before it becomes symptomatic. Conversely, if treatment is completely inefficaciousthat is, there is no useful treatment for even presymptomatic diseasethere is also no possible gain from screening. Screening can only be beneficial if treatment of presymptomatic disease is more efficacious than treatment of symptomatic disease (2931). (However, some hold that screening for untreatable genetic diseases and other untreatable diseases can be reasonable because parents can alter reproductive behavior and patients can gain more time to prepare for the consequences of disease.) Given these requirements regarding treatment effectiveness for screening to be sensible, new developments in treatmentfor example, the introduction of pharmaceuticals such as donepezil hydrochloride (Aricept; Eisai America, New York, NY) that slows the previously unalterable rate of progression of Alzheimer diseasecan completely alter the relevance of screening.
Evaluation of Screening
In general, the efficacy of treatment of presymptomatic disease relative to that of symptomatic disease is not known, although this is a critical issue for screening, as indicated in the previous paragraph. The reason for the lack of knowledge is as follows: if screening has not been done previously, relative efficacy simply is not known because presymptomatic cases have not been identified and treated. On the other hand, if the issue is introduction of a more sensitive screening test, one does not know the efficacy of treating the additional, presumably less advanced cases the new test detects. Partly for this reason, evaluation of screening generally has to consist of a randomized controlled trial in which (a) the intervention consists of the test and the treatment in combination and (b) the end point studied is the death rate, morbidity, or other adverse outcome(s) from the disease being screening for in the intervention population compared with the rates in the control population.
Biases
Three well-known biases (30,32,33) also generally necessitate this randomized controlled trial study design for evaluation of screening tests and generally preclude the use of other end points, such as 5-year survival from time of diagnosis. These three biases should be understood by all radiologists.
"Lead-time bias" refers to the fact that screening will allow detection of disease earlier in its natural history than will waiting for symptoms, so any measurement from time of diagnosis will be biased in favor of screening, regardless of the effectiveness of treatment. Consider an oversimplified example: For lung cancer, 5-year survival from diagnosis is currently 10%20%. Assume that CT screening advances diagnosis by 5
years, but treatment has absolutely no value. Then 5-year survival would nonetheless increase to essentially 100% with screening. In short, survival time in a screened group will incorrectly appear to be better than that in a nonscreened group.
"Overdiagnosis bias" or "pseudodisease" (29,31) refers to the fact that applying a diagnostic test to asymptomatic individuals will identify "positive cases" that will never become clinically manifest in a persons lifetime. Prostate cancer provides a striking example. It is the most common nonskin malignancy in men in the United States, affecting 10% of them, but careful histopathologic examination at autopsy shows microscopic prostate cancers in nearly 50% of men over the age of 75 years (34). If an imaging test as sensitive as histologic examination at autopsy were developed, but early detection had absolutely no effect on outcomes, the percentage of "cases" showing adverse outcomes would nonetheless decrease by four-fifthsbut only because four-fifths of the "cases" never would have shown any effects of the disease in the absence of screening and treatment. The general point is that, because of overdiagnosis bias, any study of the outcome of cases identified with a screening test will be biased toward screening, for many of the cases identified with screening would never have had any adverse outcomes, even in the absence of treatment. Incidentally, the morbidity and cost of treating such cases is one of the negative consequences of screening.
"Length bias" can be thought of as an attenuated form of pseudodisease. It arises because cases of a disease vary in aggressiveness, with the faster-progressing cases typically also having a natural history with greater morbidity and mortality. Cases detected with screening are typically disproportionately indolent. This is because slow-progressing cases remain longer in the presymptomatic phase in which they are detectable only with screening and do not manifest symptoms. Thus, a test that helps identify asymptomatic cases disproportionately uncovers indolent cases, as Figure 3 shows. Hence, cases detected with screening disproportionately have a relatively favorable prognosis, regardless of the effectiveness of treatment. Thus, any study of outcomes in cases detected with screening (vs those detected when symptoms occur) will be biased toward screening.
|
The percentage reduction in the risk of an adverse effect from the disease being screened for, called "relative risk reduction," is a common measure of the benefit of screening, but this measure needs to be set in context (35). For example, if screening reduces an individuals risk of dying of a particular disease over the next decade from 1.0% to 0.4%, that is a 60% decrease in relative risk, but only 0.6 of a percentage point increase in the probability of surviving the decade.
In conclusion, for any health care intervention, including diagnostic imaging tests, the ultimate questions are, "How much does this do to improve the health of people?" and "How much does it cost for that gain in health?" By using the methods described in this article, we have the ability to answer these questions as we assess the remarkable imaging technologies available today.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
G. T. Sica Bias in Research Studies Radiology, March 1, 2006; 238(3): 780 - 789. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Hollingworth Radiology Cost and Outcomes Studies: Standard Practice and Emerging Methods Am. J. Roentgenol., October 1, 2005; 185(4): 833 - 839. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |