Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


DOI: 10.1148/radiol.2471070822
This Article
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Gur, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gur, D.
(Radiology 2008;247:8-11.)
© RSNA, 2008


Editorials

Imaging Technology and Practice Assessment Studies: Importance of the Baseline or Reference Performance Level1

David Gur, ScD

1 From the Department of Radiology, University of Pittsburgh, Imaging Research, Rm 223 FARP Building, 3362 Fifth Ave, Pittsburgh, PA 15213-3180. Received May 5, 2007; revision requested June 20; revision received June 28; final version accepted August 23. Supported in part by grant EB003503 (to the University of Pittsburgh) from the National Institute for Biomedical Imaging and Bioengineering, National Institutes of Health. Address correspondence to the author (e-mail: gurd{at}upmc.edu).

Many years ago, when technology developments and practice decisions were made by manufacturers and the radiology community, these were heavily dependent upon the "wow factor," which was based primarily on the subjective assessments of but a few "authorities" or "thought leaders" in the field; only limited science had been included in the deliberations. Since then, we have become more sophisticated and scientifically minded in our decision-making process. Objective retrospective and prospective studies are now routinely used for technology and practice assessments (13). Business decisions and clinical practice preferences are frequently made on the basis of the results of these studies. The regulatory process has also relied heavily on objectively ascertained data (1,4,5). Unfortunately, because of cost and complexity, inferences (ie, conclusions) are often based solely (or primarily) on the results of one "pivotal" study, possibly ignoring the fact that this very study may not be representative of actual clinical practice (6). Worse yet, the study results of the current practice, which serve as the baseline or reference performance level for comparison with the results of the new technology or practice being evaluated, may not reflect a generally accepted performance level. In addition, perspective regarding the measured reference performance level in these studies is often brief and frequently completely ignored.

With increasing experience in assessing value, particularly in terms of demonstrated increases in efficiencies, effectiveness, and diagnostic accuracy or decreases in costs (or a combination thereof), the role of appropriate interpretation of scientific observations has changed dramatically. It is our responsibility to convey inferences that are generalizable and likely to withstand the test of time in a rapidly changing environment. Furthermore, because of the possible implications of inferences we make, we should do our best to address issues that were not as important (or pertinent) in the past. This editorial will address but one important issue related to inferences made as a result of pivotal studies that I believe has been largely ignored: namely, the performance level of the reference (often termed baseline or current) technology or practice.

Several of the more basic concepts used in this paper have been recently described in detail by Wagner et al (7) in an outstanding review article on systems evaluations and performance assessment of diagnostic systems. In the present editorial, the term operating point is defined as the sensitivity and specificity level of an observer rating a set of cases as either positive or negative (a binary decision) for the abnormality (or task) in question. This point represents a level of performance for the observer under the conditions in which the cases were read, and it depends on different factors, including but not limited to the case set, the conditions, the "aggressiveness" (threshold for positive or negative) under which the set was read, and the actual proficiency of the observer. A good example of operating points marked in the domain (space) of sensitivity and 1 – specificity for different observers reading the same set of cases is provided by Wagner et al (7) in figure 1. Generally, it can be assumed that the operating point of an observer would be but one point on the individual's or group's performance curve. Performance curve is defined here as the estimated curve of performance for an individual observer, a group, or a technology (eg, computer-aided detection [CAD] alone) in the sensitivity and specificity domain, representing what should be the different sensitivity and specificity levels under different aggressiveness conditions. For example, a receiver operating characteristic (ROC) curve represents a performance curve, but there are other types of performance curves as well (eg, free-response ROC and location ROC).

We have substantially improved our understanding of performance curves and how these can be used to generate relevant summary indices (eg, area under the ROC curve [AUC]). However, actual performance as measured in a single study often results in one operating point (eg, sensitivity or specificity) for each test or for an observer or for but a short, experimentally measured segment of the complete performance curve (2). This operating point, or multiple experimentally determined operating points, is often used for characterizing (eg, estimate by means of curve fitting) a complete performance curve, but this approach could have important limitations. As related to this editorial, when computing an operating point for a current practice (eg, diagnostic performance of screen-film mammography for mass detection) and comparing it to a modified practice, the actual point on the performance curve representing the baseline practice, as well as the slope of the estimated (fitted) curve at the measured operating point, are important factors that could easily affect study conclusions. While the concept of an operating point is discussed here in somewhat general terms, in many instances it is actually applicable to the individual case read by many observers, a reader interpreting a set of examinations, or pooled reading data from a group of observers interpreting a given set of cases. This concept is often applied to subsets of cases or readers, as well (eg, all examinations of women with dense breast tissue) (2).

Several studies have demonstrated that the addition of CAD to mammographic screening practices generally results in an improvement in cancer detection at the cost of an increase in recall rate (considered false-positives for this purpose) (810). However, the higher the level of baseline performance (ie, without CAD), the more difficult it is to demonstrate a clinically relevant improvement (ie, with CAD). This may seem trivial, but the fact is that different studies show widely varying levels of baseline performance.

It is often assumed that baseline performance in the specific study performed for hypothesis testing is low enough to enable one to show an improvement reasonably easily. This is not always the case. For example, the recent study published in the New England Journal of Medicine by members of the Breast Cancer Surveillance Consortium (11), had relatively high baseline sensitivity (80%) for screen-film mammography alone (without CAD). This resulted in a lower likelihood that significant, clinically relevant sensitivity improvements could be demonstrated with CAD. The opposite can also be true. A relatively low sensitivity level for the baseline practice makes it inherently easier to compete, as shown for example in the Digital Mammographic Imaging Screening Trial (DMIST) study (sensitivity of 66% for the whole population and 51% for women under 50 years of age) (2) comparing screen-film mammography and full-field digital mammography or, in the magnetic resonance (MR) imaging screening trial, comparing screen-film mammography (sensitivity 40%) and MR imaging in women with a familial or genetic disposition (12). In the DMIST study, baseline sensitivity for screen-film mammography was quite low, as noted by the authors. The reasons for this low performance may be several, but the ultimate effect is the same. Competing with a less-than-optimal baseline performance level may lead to conclusions that are not likely to be sustained, at least not on the same order of magnitude, in the future. While the examples presented here are related to recent work in breast imaging, this issue is clearly relevant to other imaging and nonimaging technology and practice assessments.

I intentionally use the term clinically relevant differences, since variability (eg, between readers) decreases rapidly as one approaches a perfect performance level (ie, equal to zero at 100% accuracy). Hence, statistical significance, which is determined by measuring performance differences divided by the appropriate computed variance, may be more easily demonstrated for very small differences between technologies or practices that result in extremely high performance levels. However, the clinical relevance of these differences has to be carefully considered with an appropriate perspective.

The slope of the estimated performance curve typically changes continually along the performance curve (from very high slope at very low sensitivity levels to virtually zero slope at 100% sensitivity). Hence, the slope at the measured operating point is an important determinant, whether or not the new (competing) practice is likely to demonstrate an improvement in performance. For CAD use, the estimated ratio of improvements in sensitivity on the one hand and the corresponding increases in recall rate on the other ranges in several studies between 1.0 to 1.5. Namely, several studies show approximately 1.0%–1.5% increases in sensitivity for every 1% increase in recall rate (810). If the baseline practice has a comparable slope at the measured operating point, the use of CAD will most likely be similar to moving "along" the same performance curve, and no significant improvement in terms of an overall performance, such as AUC, is likely to be demonstrated. If the slope for the new practice is substantially lower than that of the slope of the baseline practice at that point, the performance curve for the new practice (eg, use of CAD) will more likely show a decrease in performance (Figure). The opposite is true, too; namely, if the slope of the baseline performance curve is lower than that of the new practice at the experimentally measured operating point, an improvement is more likely to be demonstrated (Figure).


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PPT slide]
 
ROC plot serves as demonstration of a diagnostic system performance curve. Two operating points, (1) and (2), are shown. At each, two arrows represent possible performance changes (dashed curves) due to technology or practice change, with all four resulting in increased sensitivity. Note that the paired arrows (high or low slopes) have equal slopes at both operating points. At (1), the change in practice with the higher slope moves it along the baseline performance curve, while the one with the lower slope moves the practice to a lower performance curve; yet, both increase sensitivity. At (2), the practice with the higher slope moves it to a higher curve while the one with the lower slope moves it along the baseline performance curve. These seemingly different conclusions are the direct result of the reference curve slopes at operating points of interest. Note that it is actually unlikely that changes in technology or practices will result in the same slope at different operating points, and the figure is provided solely for illustration purposes.

 
The underlying assumption we generally make is that a complete performance curve rather than a single operating point describes the performance of a diagnostic system. Therefore, one could argue, for example, that training of radiologists to simply change their threshold (eg, be more or less conservative or aggressive as to what they recall) could move them along this performance curve. Hence, when a new practice is considered (eg, use of CAD), one could argue that it may not be needed because appropriate training to be more aggressive could result in a similar or, in some cases, even better change in performance. The fact is that we do not know how to train observers to change their decision threshold in a consistent manner. Our experience has been that even when an observer is successfully trained to change his or her decision threshold for a while, over time they tend to migrate back to their underlying or "natural" operating point. This operating point is often different for each practicing radiologist. Frequently, it is unclear how to move radiologists to another point on their own performance curves, let alone to a point on a higher performance curve, by means other than through the use of new technologies or diagnostic aids (eg, CAD). Therefore, when we are interested in improving sensitivity, for example, we may ultimately decide to implement the new practice (eg, use of CAD) even though it may represent a less than theoretically optimal way for achieving this objective. The new practice may indeed result in increasing sensitivity despite the fact that it does not always increase (and sometimes decreases) the curve-dependent summary index (eg, AUC or positive predictive value) (8,10,11).

Several of the studies mentioned in this editorial are probably some of the most well executed and described multisite, multireader experiments in this field (2,11,12), but I continue to wonder about the following regarding the Breast Cancer Surveillance Consortium and DMIST studies: If we can imagine the baseline performance levels of these two studies for screen-film imaging alone being interchanged, would full-field digital mammography look as good compared with screen-film mammography in the DMIST study? Probably not. Similarly, would CAD use look as bad when compared with screen-film mammography alone in the Breast Cancer Surveillance Consortium study? Again, probably not. Both comments are related to the measured baseline performance levels for screen-film imaging alone, regardless of the competing modalities in the studies.

We need to seriously consider many issues when we make practice decisions based on the results of pivotal studies. One of the most important issues is the reference performance level being compared with that of the new technology or practice. A range of acceptable ("reasonable") baseline performance levels for the reference practice should be defined (and agreed on) a priori. When this level is not achieved in any study (either substantially higher or lower), one should wonder why. Since double-blind, multisite, prospective studies are impractical and virtually impossible to execute in our field, the best inferences we can hope for in most instances in this type of experimental environment is to compare the best the current practice offers with the best the new practice can offer, recognizing that neither is likely to be actually achieved in the clinical environment.

Unfortunately, in the diagnostic imaging area, even very strictly designed studies such as randomized controlled trials remain primarily laboratory studies. Little is known about the "laboratory effect," if any, and how to measure it under different experimental conditions and environments or how to account for it in inferences made as a result of laboratory studies. All we can hope for is that if there is an effect, it will be similar for the reference and the competing practice; hence, on a relative scale, inferences based on differences in performance levels (or lack thereof) in the laboratory will remain valid in the actual clinical practice.

These multivariable (eg, sites, cases, readers, and modalities) studies are often extremely complicated to execute and are certainly difficult to interpret with the appropriate perspective. When a study shows a substantially lower or higher than expected and generally accepted performance for the reference practice, the possibility that the study results are affected by outliers (eg, sites, readers, or cases) or that the whole study constitutes an outlier should be carefully evaluated and discussed.

There are many other issues related to this general topic. However, I have chosen to focus here on but one, the baseline performance level.


    References
 TOP
 References
 

  1. Warren Burhenne LJ, Wood SA, D'Orsi CJ, et al. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 2000;215(2):554–562. [Published correction appears in Radiology 2000;216(1):306.][Abstract/Free Full Text]
  2. Pisano ED, Gatsonis C, Hendrick E, et al. Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005;353(17):1773–1783.[Abstract/Free Full Text]
  3. Lehman CD, Gatsonis C, Kuhl CK, et al. MRI evaluation of the contralateral breast in women with recently diagnosed breast cancer. N Engl J Med 2007;356(13):1295–1303.[Abstract/Free Full Text]
  4. Center for Devices and Radiological Health. CDRH consumer information: new device approval—ImageChecker CT CAD Software System (model LN-1000)-P030012. U.S. Food and Drug Administration. http://www.fda.gov/cdrh/mda/docs/p030012.html. Published July 8, 2004. Updated August 19, 2004. Accessed June 25, 2007.
  5. Alpert S. U.S. Food and Drug Administration. Approval letter for M1000 ImageChecker for mammography. U.S. Food and Drug Administration. http://www.fda.gov/cdrh/pdf/p970058.pdf.Published June 26, 1998. Accessed June 25, 2007.
  6. Gur D, Sumkin JH, Rockette HE, et al. Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. J Natl Cancer Inst 2004;96(3):185–190.[Abstract/Free Full Text]
  7. Wagner RF, Metz CE, Campbell G. Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol 2007;14(6):723–748.[CrossRef][Medline]
  8. Birdwell RL, Bandodkar P, Ikeda DM. Computer-aided detection with screening mammography in a university hospital setting. Radiology 2005;236(2):451–457.[Abstract/Free Full Text]
  9. Freer TW, Ulissey MJ. Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. Radiology 2001;220(3):781–786.[Abstract/Free Full Text]
  10. Dean JC, Ilvento CC. Improved cancer detection using computer-aided detection with diagnostic and screening mammography: prospective study of 104 cancers. AJR Am J Roentgenol 2006;187(1):20–28.[Abstract/Free Full Text]
  11. Fenton JJ, Taplin SH, Carney PA, et al. Influence of computer-aided detection on performance of screening mammography. N Engl J Med 2007;356(14):1399–1409.[Abstract/Free Full Text]
  12. Kriege M, Brekelmans CT, Boetes C, et al. Efficacy of MRI and mammography for breast-cancer screening in women with a familial or genetic predisposition. N Engl J Med 2004;351(5):427–437.[Abstract/Free Full Text]




This Article
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Gur, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gur, D.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
RADIOLOGY RADIOGRAPHICS RSNA JOURNALS ONLINE