|
|
||||||||
Editorials |
1 From the Department of Radiology, University of Pittsburgh, Imaging Research, Rm 223 FARP Building, 3362 Fifth Ave, Pittsburgh, PA 15213-3180. Received May 5, 2007; revision requested June 20; revision received June 28; final version accepted August 23. Supported in part by grant EB003503 (to the University of Pittsburgh) from the National Institute for Biomedical Imaging and Bioengineering, National Institutes of Health. Address correspondence to the author (e-mail: gurd{at}upmc.edu).
Many years ago, when technology developments and practice decisions were made by manufacturers and the radiology community, these were heavily dependent upon the "wow factor," which was based primarily on the subjective assessments of but a few "authorities" or "thought leaders" in the field; only limited science had been included in the deliberations. Since then, we have become more sophisticated and scientifically minded in our decision-making process. Objective retrospective and prospective studies are now routinely used for technology and practice assessments (1–3). Business decisions and clinical practice preferences are frequently made on the basis of the results of these studies. The regulatory process has also relied heavily on objectively ascertained data (1,4,5). Unfortunately, because of cost and complexity, inferences (ie, conclusions) are often based solely (or primarily) on the results of one "pivotal" study, possibly ignoring the fact that this very study may not be representative of actual clinical practice (6). Worse yet, the study results of the current practice, which serve as the baseline or reference performance level for comparison with the results of the new technology or practice being evaluated, may not reflect a generally accepted performance level. In addition, perspective regarding the measured reference performance level in these studies is often brief and frequently completely ignored.
With increasing experience in assessing value, particularly in terms of demonstrated increases in efficiencies, effectiveness, and diagnostic accuracy or decreases in costs (or a combination thereof), the role of appropriate interpretation of scientific observations has changed dramatically. It is our responsibility to convey inferences that are generalizable and likely to withstand the test of time in a rapidly changing environment. Furthermore, because of the possible implications of inferences we make, we should do our best to address issues that were not as important (or pertinent) in the past. This editorial will address but one important issue related to inferences made as a result of pivotal studies that I believe has been largely ignored: namely, the performance level of the reference (often termed baseline or current) technology or practice.
Several of the more basic concepts used in this paper have been recently described in detail by Wagner et al (7) in an outstanding review article on systems evaluations and performance assessment of diagnostic systems. In the present editorial, the term operating point is defined as the sensitivity and specificity level of an observer rating a set of cases as either positive or negative (a binary decision) for the abnormality (or task) in question. This point represents a level of performance for the observer under the conditions in which the cases were read, and it depends on different factors, including but not limited to the case set, the conditions, the "aggressiveness" (threshold for positive or negative) under which the set was read, and the actual proficiency of the observer. A good example of operating points marked in the domain (space) of sensitivity and 1 – specificity for different observers reading the same set of cases is provided by Wagner et al (7) in figure 1. Generally, it can be assumed that the operating point of an observer would be but one point on the individual's or group's performance curve. Performance curve is defined here as the estimated curve of performance for an individual observer, a group, or a technology (eg, computer-aided detection [CAD] alone) in the sensitivity and specificity domain, representing what should be the different sensitivity and specificity levels under different aggressiveness conditions. For example, a receiver operating characteristic (ROC) curve represents a performance curve, but there are other types of performance curves as well (eg, free-response ROC and location ROC).
We have substantially improved our understanding of performance curves and how these can be used to generate relevant summary indices (eg, area under the ROC curve [AUC]). However, actual performance as measured in a single study often results in one operating point (eg, sensitivity or specificity) for each test or for an observer or for but a short, experimentally measured segment of the complete performance curve (2). This operating point, or multiple experimentally determined operating points, is often used for characterizing (eg, estimate by means of curve fitting) a complete performance curve, but this approach could have important limitations. As related to this editorial, when computing an operating point for a current practice (eg, diagnostic performance of screen-film mammography for mass detection) and comparing it to a modified practice, the actual point on the performance curve representing the baseline practice, as well as the slope of the estimated (fitted) curve at the measured operating point, are important factors that could easily affect study conclusions. While the concept of an operating point is discussed here in somewhat general terms, in many instances it is actually applicable to the individual case read by many observers, a reader interpreting a set of examinations, or pooled reading data from a group of observers interpreting a given set of cases. This concept is often applied to subsets of cases or readers, as well (eg, all examinations of women with dense breast tissue) (2).
Several studies have demonstrated that the addition of CAD to mammographic screening practices generally results in an improvement in cancer detection at the cost of an increase in recall rate (considered false-positives for this purpose) (8–10). However, the higher the level of baseline performance (ie, without CAD), the more difficult it is to demonstrate a clinically relevant improvement (ie, with CAD). This may seem trivial, but the fact is that different studies show widely varying levels of baseline performance.
It is often assumed that baseline performance in the specific study performed for hypothesis testing is low enough to enable one to show an improvement reasonably easily. This is not always the case. For example, the recent study published in the New England Journal of Medicine by members of the Breast Cancer Surveillance Consortium (11), had relatively high baseline sensitivity (80%) for screen-film mammography alone (without CAD). This resulted in a lower likelihood that significant, clinically relevant sensitivity improvements could be demonstrated with CAD. The opposite can also be true. A relatively low sensitivity level for the baseline practice makes it inherently easier to compete, as shown for example in the Digital Mammographic Imaging Screening Trial (DMIST) study (sensitivity of 66% for the whole population and 51% for women under 50 years of age) (2) comparing screen-film mammography and full-field digital mammography or, in the magnetic resonance (MR) imaging screening trial, comparing screen-film mammography (sensitivity 40%) and MR imaging in women with a familial or genetic disposition (12). In the DMIST study, baseline sensitivity for screen-film mammography was quite low, as noted by the authors. The reasons for this low performance may be several, but the ultimate effect is the same. Competing with a less-than-optimal baseline performance level may lead to conclusions that are not likely to be sustained, at least not on the same order of magnitude, in the future. While the examples presented here are related to recent work in breast imaging, this issue is clearly relevant to other imaging and nonimaging technology and practice assessments.
I intentionally use the term clinically relevant differences, since variability (eg, between readers) decreases rapidly as one approaches a perfect performance level (ie, equal to zero at 100% accuracy). Hence, statistical significance, which is determined by measuring performance differences divided by the appropriate computed variance, may be more easily demonstrated for very small differences between technologies or practices that result in extremely high performance levels. However, the clinical relevance of these differences has to be carefully considered with an appropriate perspective.
The slope of the estimated performance curve typically changes continually along the performance curve (from very high slope at very low sensitivity levels to virtually zero slope at 100% sensitivity). Hence, the slope at the measured operating point is an important determinant, whether or not the new (competing) practice is likely to demonstrate an improvement in performance. For CAD use, the estimated ratio of improvements in sensitivity on the one hand and the corresponding increases in recall rate on the other ranges in several studies between 1.0 to 1.5. Namely, several studies show approximately 1.0%–1.5% increases in sensitivity for every 1% increase in recall rate (8–10). If the baseline practice has a comparable slope at the measured operating point, the use of CAD will most likely be similar to moving "along" the same performance curve, and no significant improvement in terms of an overall performance, such as AUC, is likely to be demonstrated. If the slope for the new practice is substantially lower than that of the slope of the baseline practice at that point, the performance curve for the new practice (eg, use of CAD) will more likely show a decrease in performance (Figure). The opposite is true, too; namely, if the slope of the baseline performance curve is lower than that of the new practice at the experimentally measured operating point, an improvement is more likely to be demonstrated (Figure).
|
Several of the studies mentioned in this editorial are probably some of the most well executed and described multisite, multireader experiments in this field (2,11,12), but I continue to wonder about the following regarding the Breast Cancer Surveillance Consortium and DMIST studies: If we can imagine the baseline performance levels of these two studies for screen-film imaging alone being interchanged, would full-field digital mammography look as good compared with screen-film mammography in the DMIST study? Probably not. Similarly, would CAD use look as bad when compared with screen-film mammography alone in the Breast Cancer Surveillance Consortium study? Again, probably not. Both comments are related to the measured baseline performance levels for screen-film imaging alone, regardless of the competing modalities in the studies.
We need to seriously consider many issues when we make practice decisions based on the results of pivotal studies. One of the most important issues is the reference performance level being compared with that of the new technology or practice. A range of acceptable ("reasonable") baseline performance levels for the reference practice should be defined (and agreed on) a priori. When this level is not achieved in any study (either substantially higher or lower), one should wonder why. Since double-blind, multisite, prospective studies are impractical and virtually impossible to execute in our field, the best inferences we can hope for in most instances in this type of experimental environment is to compare the best the current practice offers with the best the new practice can offer, recognizing that neither is likely to be actually achieved in the clinical environment.
Unfortunately, in the diagnostic imaging area, even very strictly designed studies such as randomized controlled trials remain primarily laboratory studies. Little is known about the "laboratory effect," if any, and how to measure it under different experimental conditions and environments or how to account for it in inferences made as a result of laboratory studies. All we can hope for is that if there is an effect, it will be similar for the reference and the competing practice; hence, on a relative scale, inferences based on differences in performance levels (or lack thereof) in the laboratory will remain valid in the actual clinical practice.
These multivariable (eg, sites, cases, readers, and modalities) studies are often extremely complicated to execute and are certainly difficult to interpret with the appropriate perspective. When a study shows a substantially lower or higher than expected and generally accepted performance for the reference practice, the possibility that the study results are affected by outliers (eg, sites, readers, or cases) or that the whole study constitutes an outlier should be carefully evaluated and discussed.
There are many other issues related to this general topic. However, I have chosen to focus here on but one, the baseline performance level.
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |