Published online before print February 7, 2008, 10.1148/radiol.2471071321
(Radiology 2008;247:12-15.)
© RSNA, 2008
Comparing Areas under Receiver Operating Characteristic Curves: Potential Impact of the "Last" Experimentally Measured Operating Point1
David Gur, ScD,
Andriy I. Bandos, PhD, and
Howard E. Rockette, PhD
1 From the Department of Radiology (D.G.) and Department of Biostatistics, Graduate School of Public Health (A.I.B., H.E.R.), University of Pittsburgh, Imaging Research, FARP Building, 3362 Fifth Ave, Pittsburgh, PA 15213. Received July 26, 2007; revision requested September 10; revision received September 24; final version accepted October 25. Supported in part by grants EB001694, EB002106, and EB003503 (to the University of Pittsburgh) from the National Institute for Biomedical Imaging and Bioengineering, National Institutes of Health.
Address correspondence to D.G. (e-mail: gurd{at}upmc.edu).
 |
ABSTRACT
|
|---|
A specific issue related to the selection of the analytic tool used when comparing the estimated performance of systems within the receiver operating characteristic (ROC) paradigm is reviewed. This issue is related to the possible effect of the last experimentally ascertained ROC data point in terms of highest true-positive and false-positive fractions. An example of a case is presented where the selection of a specific analysis approach could affect the study conclusion from being not statistically significant for parametric analysis and significant for nonparametric analysis. This is followed by recommendations that should help prevent misinterpretation of results.
© RSNA, 2008
 |
INTRODUCTION
|
|---|
Currently, observer performance studies are routinely performed for the assessment and comparison of technologies and practices, and the area under the receiver operating characteristic (ROC) curve (AUC) is the most frequently used summary index when comparing different modalities (1–5). In the medical imaging field, conclusions resulting from important pivotal studies are often based on the assessment of differences between areas under estimated ROC curves that include substantial portions near which no experimental data lie. Hence, in most instances, it would be preferable to assess differences between partial AUCs (6,7). Unfortunately, the large variability frequently associated with these studies would necessitate extremely large sample sizes to enable demonstration of significant differences between partial AUCs, making this approach impractical in many situations. As a result, we continue to do the best we can under the circumstances: namely, estimate the differences between the AUCs and assess the significance level of these differences, if one exists.
A variety of parametric and nonparametric approaches have been developed to make statistical inferences based on AUCs in experiments that are performed under the ROC paradigm (8–17). An implicit assumption in all of these approaches is that even if the estimated AUCs are not precise in absolute terms, the comparison of two or more AUCs on a relative scale is frequently valid regardless of the analytic approach used to estimate the individual ROC curves being compared. All of the methods employed to date extrapolate (or more precisely interpolate, depending on how one considers the "sensitivity = 1, specificity = 0" point) from the last experimentally ascertained ROC point (termed here an operating point) to the upper-right-hand point (ie, the sensitivity and 1 – specificity—or 1,1—coordinate) in the ROC domain. The parametric and the nonparametric approaches achieve this quite differently, and the specific methodological approach depends on the actual analytic tool being used (14,18,19). Fundamentally, however, the nonparametric approach, which is solely data driven, (ie, no underlying assumption is made regarding the distribution of the data) extends the ROC curve in a linear way from the last experimental point to the 1,1 point in the ROC domain.
In a recent publication (20), a comparison between two paired ROC curves showed the differences to be not statistically significant when the parametric approach was used; however, when the nonparametric approach was used, the differences were statistically significant. This motivated an investigation into the possible reasons for such a discrepancy, in that there may be a fundamental problem not adequately recognized previously that may have caused it.
The question posed in this article is whether the last experimentally ascertained operating point could affect study conclusions and, if so, how. To address this question, an example of two ROC data sets, with the experimental points associated with each, is presented. This example is based on a sample of 348 actually negative and 181 actually positive cases interpreted by two readers with a paired design (Figure).

View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
|
Four ROC curves generated by parametric and nonparametric approaches by using a data set of two readers interpreting the same cases. AUCs (± standard deviation) for the parametric approach are 0.81 ± 0.065 and 0.83 ± 0.034 for readers 1 and 2, respectively (P = .75). AUCs corresponding to the nonparametric approach are 0.68 ± 0.025 and 0.74 ± 0.024 for readers 1 and 2, respectively (P = .003).
|
|
 |
THE PROBLEM
|
|---|
Assume an ROC curve describing an experiment where the curves being compared are estimated by two similar curves under the parametric model but in one mode the experimental data extend beyond the last experimental point of the other (Figure). In the Figure, the parametric approach (ROCKIT; C. E. Metz, B. A. Herman, C. A. Roe, University of Chicago, Ill; http://xray.bsd.uchicago.edu/krl/KRL_ROC/software_index6.htm) was used to fit the data sets of the two readers interpreting the same set of 529 cases during an ROC-type study. Note that in many instances, whether the actual parametrically fitted curves are the same or are only similar in overall shape would not affect the primary issue we are addressing. As can be seen from the two types of curves, the parametric approach extends the curve to the 1,1 point in a totally different manner than does the nonparametric approach. The parametric approach extrapolates (actually interpolates) the performance curve between the last experimentally ascertained operating point and the 1,1 point—namely, the upper right corner of the ROC domain—under the assumption that the adopted parametric model provides a good approximation throughout the domain. Specifically, the estimated parameters of the curve are assumed to be valid beyond the region in which experimental data are truly available. On the other hand, the nonparametric approach assumes random decision making beyond the region for which experimental data are available. As a result, the computed AUC is clearly dependent on the approach taken and, most important to this discussion, the last actually measured operating point. The fact is that there is nothing in the experimental design itself that would force the experimentally ascertained points to have the same specificity, the same sensitivity, or both.
Reviews of our own studies suggest that this is not always the case. For example, the Figure shows a very small difference in the parametric estimates of the AUCs between the two readers (AUC difference between readers 2 and 1 is 0.02 [0.83 – 0.81], which is not significant [P = .75; ROCKIT, version 0.9B]). At the same time, the nonparametric estimates of the AUCs are 0.68 and 0.74 for readers 1 and 2, respectively, for a statistically significant difference (P = .003), according to the approach of Delong et al (15). More important, perhaps, the difference between the two curves is quite visible on the plot. In some situations, it can be easily envisioned that the actual performance estimates as a result of one type of analysis (eg, parametric) are the reverse of those of the other type (eg, nonparametric), and this is more likely to occur in the case of crossing curves in a parametric model. This could lead to totally different (and sometimes opposing) study conclusions.
In the article by Glueck et al (20), figures 1 and 2 are a clear demonstration that this could occur even when the data from different modes are not represented by very similar curves. For example, the two curves related to screen-film mammography alone and to the combined mode (screen-film, digital, or both) yielded AUCs of 0.83 and 0.89, respectively (actual difference, 0.056; P = .07), with the parametric model (table 2 in reference 20). Visually, these ROC curves are reasonably similar. The difference in AUC increased to 0.073 (difference between 0.78 and 0.85) with the nonparametric model (table 3 in reference 20), which was significant (P = .008). Visually, these curves look notably different as well. The reason we mention the results of Glueck et al is that in their study, the use of the nonparametric method indeed had a significantly different outcome than did the parametric method (ie, reject vs fail to reject the hypothesis of equality of AUCs).
 |
DISCUSSION
|
|---|
We have shown that in the case of nonparametric analyses of differences in AUCs, the last experimentally ascertained operating point can significantly affect AUC estimates in a manner that could result in different study conclusions, as compared with AUC estimates derived when using the parametric approach. It is very important that investigators plot the actual experimentally ascertained data and view these data together with the ROC fitted plots before making conclusions based on one type of analysis or the other. Providing a parametric fitted curve without the actually ascertained points being presented may be difficult to interpret (2). If a partial AUC is of interest, the investigators should make sure that experimentally ascertained data points cover the range of the false-positive fraction of interest to prevent these possible effects, as well. Some of these issues have been discussed in a recent review article by Wagner et al (21). As statisticians frequently consider the use of nonparametric approaches to analyzing ROC type experiments, investigators should be cognizant of the possible effect described here.
Note that this article addresses issues related to ROC methodology and the use of the AUC as a summary index under the ROC paradigm, but some of the concepts described here may also be similarly relevant to experiments performed under the free-response ROC, or FROC, paradigm. However that topic is beyond the scope of this article. In addition, the issue discussed here is frequently not a problem in assessment of the performance of technology alone (eg, computer-aided detection) because the experimental data generated by computer-aided detection systems are frequently well distributed along the complete false-positive fraction axis (22,23).
We believe that, in general, when the last operating points are not the same in terms of actually measured sensitivity, the higher the slope of the performance curve at the last experimentally ascertained data points (ie, at a low false-positive rate) the larger the likelihood that differences in AUCs will differ between a nonparametric analysis and a parametric analysis. This may or may not result in different conclusions for the study. It is truly not clear which of the two approaches is the one that should routinely be used. Since it is not known how the performance curve truly behaves beyond the last experimental operating point, arguments in support of either approach could be made. However, it can reasonably be argued that the parametric approach is more likely to represent how the curve should behave in the extrapolated region, given that the estimated performance curve does not exhibit a "hook." By hook we mean that the fitted curve is concave upward in the considered region. When a hook occurs, it is unlikely to represent the actual behavior of human observers.
With the binormal ROC model, whenever the variance for the negative and positive cases in the set is not the same, which is very frequently the case, curve hooking can occur, as can be seen in published performance (1). This problem has been addressed to a large extent in newer parametric models (18,19), but whenever the experimental data are predominantly in the low false-positive region (left side of the ROC domain) it is harder to judge the quality and "actual performance plausibility" of the complete curve. In addition, the price for a "proper-looking" extrapolation could be highly sensitive to the binormal model assumptions. It has been demonstrated that even in the case of "binormal-looking" curves that are formally not binormal ROC curves, inferences made on the basis of the binormal ROC model can be highly sensitive to the actual location of the last experimentally ascertained operating point (24).
In summary, the issue discussed here is more likely to be important when the parametrically estimated AUC is very high (eg, >0.9) and when observers tend to be more decisive in their observations (25). We believe that when most experimental points are indeed in the low false-positive region, one has to plot and carefully assess the quality of the curves. If the last experimentally ascertained operating points for the modalities being compared are substantially different in terms of true- or false-positive fractions, one should justify the actual approach taken (eg, parametric) and discuss (mention) the potential implication had the other approach (eg, nonparametric) been taken.
 |
FOOTNOTES
|
|---|
Abbreviations: AUC = area under the ROC curve ROC = receiver operating characteristic
Authors stated no financial relationship to disclose.
 |
References
|
|---|
- Pisano ED, Gatsonis C, Hendrick E, et al. Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005;353(17):1773–1783.[Abstract/Free Full Text]
- Fenton JJ, Taplin SH, Carney PA, et al. Influence of computer-aided detection on performance of screening mammography. N Engl J Med 2007;356(14):1399–1409.[Abstract/Free Full Text]
- Gur D, Rockette HE, Armfield DR, et al. Prevalence effect in a laboratory environment. Radiology 2003;228(1):10–14.[Abstract/Free Full Text]
- Skaane P, Balleyguier C, Diekmann F, et al. Breast lesion detection and classification: comparison of screen-film mammography and full-field digital mammography with soft-copy reading—observer performance study. Radiology 2005;237(1):37–44.[Abstract/Free Full Text]
- Shiraishi J, Abe H, Li F, Engelmann R, MacMahon H, Doi K. Computer-aided diagnosis for the detection and classification of lung cancers on chest radiographs ROC analysis of radiologists' performance. Acad Radiol 2006;13(8):995–1003.[CrossRef][Medline]
- McClish DK. Analyzing a portion of the ROC curve. Med Decis Making 1989;9(3):190–195.[Abstract/Free Full Text]
- Jiang Y, Metz CE, Nishikawa RM. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology 1996;201(3):745–750.[Abstract/Free Full Text]
- Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989;24(3):234–245.[Medline]
- Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: ANOVA approach with dependent observations. Communications in Statistics—Simulation and Computation 1995;24:285–308.
- Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992;27(9):723–731.[CrossRef][Medline]
- Toledano AY. Three methods for analysing correlated ROC curves: a comparison in real data sets from multi-reader, multi-case studies with a factorial design. Stat Med 2003;22(18):2919–2933.[CrossRef][Medline]
- Obuchowski NA, Beiden SV, Berbaum KS, et al. Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol 2004;11(9):980–995.[Medline]
- Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. Acad Radiol 2000;7(5):341–349.[CrossRef][Medline]
- Bandos AI, Rockette HE, Gur D. A permutation test sensitive to differences in areas for comparing ROC curves from a paired design. Stat Med 2005;24(18):2873–2893.[CrossRef][Medline]
- DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the area under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44(3):837–845.[CrossRef][Medline]
- Campbell G, Douglas MA, Bailey JJ. Nonparametric comparison of two tests of cardiac function on the same patient population using the entire ROC curve. In: Computers in Cardiology 1988 Conference proceedings. New York, NY: IEEE, 1988; 267–270.
- Wieand HS, Gail MM, Hanley JA. A nonparametric procedure for comparing diagnostic tests with paired or unpaired data. Institute of Mathematical Statistics Bulletin 1983;12:213–214.
- Pesce LL, Metz CE. Reliable and computationally efficient maximum-likelihood estimation of "proper" binormal ROC curves. Acad Radiol 2007;14(7):814–829.[CrossRef][Medline]
- Dorfman DD, Berbaum KS, Brandser EA. A contaminated binormal model for ROC data. I. Some interesting examples of binormal degeneracy. Acad Radiol 2000;7(6):420–426.[CrossRef][Medline]
- Glueck DH, Lamb MM, Lewin JM, Pisano ED. Two-modality mammography may confer an advantage over either full-field digital mammography or screen-film mammography. Acad Radiol 2007;14(6):670–676.[CrossRef][Medline]
- Wagner RF, Metz CE, Campbell G. Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol 2007;14(6):723–748.[CrossRef][Medline]
- Wei J, Hadjiiski LM, Sahiner B, et al. Computer-aided detection systems for breast masses: comparison of performances on full-field digital mammograms and digitized screen-film mammograms. Acad Radiol 2007;14(6):659–669.[CrossRef][Medline]
- Rana RS, Jiang Y, Schmidt RA, Nishikawa RM, Liu B. Independent evaluation of computer classification of malignant and benign calcifications in full-field digital mammograms. Acad Radiol 2007;14(3):363–370.[CrossRef][Medline]
- Walsh SJ. Limitations to the robustness of binormal ROC curves: effects of model misspecification and location of decision threshold on bias, precision, size and power. Stat Med 1997;16(6):669–679.[CrossRef][Medline]
- Gur D, Rockette HE, Bandos AI. "Binary" and "non-binary" detection tasks: are current performance measures optimal? Acad Radiol 2007;14(7):871–876.[CrossRef][Medline]