|
|
||||||||
Breast Imaging |
1 From the Kurt Rossmann Laboratories for Radiologic Image Research, Dept of Radiology (Y.J., R.M.N., R.A.S., K.D.), and the Depts of Anesthesia and Critical Care (A.Y.T.) and Health Studies (A.Y.T.), Univ of Chicago, 5841 S Maryland Ave, MC2026, Chicago, IL 60637. Received Jul 13, 2000; revision requested Aug 21; final revision received Jan 15, 2001; accepted Feb 15. Supported in part by NIH grant CA 60187. Address correspondence to Y.J. (e-mail: y-jiang@uchicago.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Ten radiologists interpreted mammograms showing clustered microcalcifications in 104 patients. Decisions for biopsy or follow-up were made with and without a computer aid, and these decisions were compared. The computer was used to estimate the likelihood that a microcalcification cluster was due to a malignancy. Variability in the radiologists recommendations for biopsy versus follow-up was then analyzed.
RESULTS: Variation in the radiologists accuracy, as measured with the SD of the area under the receiver operating characteristic curve, was reduced by 46% with computer aid. Access to the computer aid increased the agreement among all observers from 13% to 32% of the total cases (P < .001), while the
value increased from 0.19 to 0.41 (P < .05). Use of computer aid eliminated two-thirds of the substantial disagreements in which two radiologists recommended biopsy and routine screening in the same patient (P < .05).
CONCLUSION: In addition to its demonstrated potential to improve diagnostic accuracy, computer-aided diagnosis has the potential to reduce the variability among radiologists in the interpretation of mammograms.
Index terms: Breast neoplasms, calcification, 00.81 Breast neoplasms, diagnosis, 00.30 Computers, diagnostic aid Diagnostic radiology, observer performance
| INTRODUCTION |
|---|
|
|
|---|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Of the malignant cases, 37 were ductal carcinoma in situ, and nine were invasive ductal carcinoma. Of the benign cases, two were lobular carcinoma in situ, four were atypical ductal hyperplasia, 16 were hyperplasia without atypia, seven were adenosis, six were fibroadenoma, 18 were fibrocystic change or fibrosis, and five were breast tissue without specific abnormality.
Consecutive cases were collected by using the following criteria: (a) A cluster of microcalcifications was the only suspect lesion, which led to the biopsy and for which the pathologic results were definitive; (b) original mammograms, including at least two standard views and one magnification view, were available; and (c) the technical quality of the mammograms was adequate for interpretation (8). To balance the number of malignant and benign cases and thereby increase statistical power, the malignant cases were collected, necessarily, from a longer period (11,12). These cases were clinically evaluated before the Breast Imaging Reporting and Data System was implemented; therefore, they were not assigned a Breast Imaging Reporting and Data System assessment category (8). Additional specific details regarding case selection are reported elsewhere (8).
Radiologist Observers
Ten radiologists, who had experience in mammography but who had not previously seen the study cases, interpreted the mammograms. Five observers were practicing radiologists from the Chicago metropolitan area, and five were senior radiology residents from our institution. For the attending radiologists, mammography accounted for an average of 30% of their clinical practice, and they were certified readers according to the Mammography Quality Standards Act. They had been reading mammograms for an average of 9 years (median, 6 years; range, 130 years), and they had read at least 1,000 mammograms in the preceding year. The residents had limited experience from training rotations of 12 months duration. Written informed consent, as approved by our institutional review board, was obtained from all observers after the nature of the experiment was fully explained. Data analysis was performed for three observer groups: all observers (n = 10), attending radiologists (n = 5), and residents (n = 5).
Computer Aid
The computer aid was an estimate of the likelihood (0%100%) that a microcalcification cluster was due to a malignancy. An artificial neural network calculated the estimate on the basis of eight image features that were automatically extracted from standard-view screen-film mammograms (13). Mammograms were digitized with a 0.1-mm pixel size and a 12-bit gray scale by using a digitizer (Lumiscan 100; Lumisys, Sunnyvale, Calif). Locations of microcalcifications were manually identified on a computer monitor (8).
The observers were explicitly instructed to use the computer aid in their interpretation. They were told that the computer output had a sensitivity (defined as the fraction of cancers for which biopsy would have been recommended) of approximately 90% and positive predictive value (defined as the fraction of all cases for which biopsy would have been recommended that were cancers) of approximately 61% when a threshold of 30% was applied to the computer-estimated likelihood of malignancy. The performance estimates of the computer were obtained from the study cases. One interpretation of this instruction is that any observer could have achieved the same accuracy as the computer by recommending biopsy only when the computer reported a likelihood of malignancy of 30% or greater.
Data Acquisition
Each observer reviewed the cases twice: once with and once without the computer aid; each review was separated by an average of 30 days (range, 1060 days). The following counterbalanced study design was used: Half of the mammograms were read without the computer aid in the first reading session and were read again with the computer aid in the second reading session; the other half of the mammograms were read first with the computer aid and then without the aid. The study design minimizes potential biases; it has been well documented (11,12) and has been described (8) in detail. The observers were asked to report (a) their level of confidence (on an analog scale of 0%100%) that a lesion was malignant and (b) their clinical recommendation (Table 1).
|
values, which were determined by using other software (SPLUS; MathSoft, Seattle, Wash). The Student t and McNemar
2 tests were used to calculate P values, and the bootstrap method was used to estimate the 95% CIs in the statistical analyses. Sensitivity was defined as the fraction of cancers for which surgical biopsy or alternative tissue sampling was recommended. Specificity was defined as the fraction of benign lesions for which short-term or routine follow-up was recommended. Because sensitivity and specificity incompletely describe accuracy and because they depend on how a radiologist selects a decision threshold to define positive diagnoses, we also performed an ROC analysis, which is the standard method for evaluating observer accuracy (6,14,15). We obtained ROC curves by fitting the binormal model to the confidence data, and we obtained summary ROC curves for the 10 observers as a group by averaging the slope and intercept parameters of the individual curves (14). The area under the ROC curve (Az) was used as a summary index of accuracy. Az can have values between 0.5, which represents no apparent accuracy (diagnoses corresponding to random chance alone), and 1.0, which represents perfect accuracy.
A histogram of interobserver agreement regarding clinical recommendations was constructed, and the
statistic was computed. This histogram displayed the number of cases as a function of the number of observers in agreement. For the 10 observers, 11 patterns of agreement in the recommendations were possible; these patterns included 10 biopsy recommendations, nine biopsy and one follow-up recommendations, eight biopsy and two follow-up recommendations, and so on. For this analysis, we compared the recommendations for biopsy (option a or b in Table 1) versus those for follow-up (option c or d in Table 1), because this is the most important clinical decision. Separate histograms were constructed for cancers and benign lesions. Separate histograms were also constructed for attending radiologists, residents, and all radiologists. The histograms were similar for the three observer groups; we report only the summary histogram of all radiologists combined.
The
statistic is widely used as a measure of agreement (16). It reflects the proportion of agreement after the proportion of agreement that can be attributed to chance alone is subtracted (17).
equals 1 for perfect agreement, and
equals 0 when the agreement can be attributed to chance alone. We computed the multireader
value (18) and estimated the 95% CIs by using the bootstrap method.
Using the definitions by Elmore et al (1), we defined substantial disagreement as a situation in which one radiologist recommended biopsy (option a or b in Table 1) and another recommended routine follow-up (option d in Table 1) in the same case (short-term follow-up was excluded from this particular analysis to emphasize extremes in decision making). Pairwise and per-patient frequencies of substantial disagreement were calculated. The pairwise frequency was the occurrence of substantial disagreement in all recommendation pairs (ie, recommendations made by two different observers in the same case). The total number of recommendation pairs was equal to the following: [number of cases x number of readers x (number of readers - 1)]/2. For 10 observers, there were a total of (104 x 10 x 9)/2, or 4,680, recommendation pairs. The per-patient frequency was the fraction of total cases (ie, 104 cases) in which different observers simultaneously recommended at least one biopsy procedure and at least one routine screening procedure. Because of the large differences in the denominators, the pairwise frequency tended to produce a low estimate, and the per-patient frequency tended to produce a high estimate of the substantial disagreement; neither was clinically accurate, because it was unlikely that 10 radiologists would have independently evaluated the case in clinical practice. Because the true frequency of substantial disagreement was expected to be between the pairwise and per-patient frequencies and because, to our knowledge, no single accurate measure is known, we report both pairwise and per-patient results, as Elmore et al (1) did.
| RESULTS |
|---|
|
|
|---|
|
|
|
2 test).
|
|
values are shown in Table 5. The results were consistent among the three observer groups (all radiologists, attending radiologists, and residents). Although the residents
values were smaller than those of the attending radiologists, the differences were not statistically significant (P > .05). In all three observer groups, use of the computer aid improved agreement from fair to moderate (on the ordinal scale where a
value of 0.210.40 represents fair agreement beyond chance, and 0.410.60 represents moderate agreement beyond chance [19]). All improvements were statistically significant (P < .05).
|
2 > 4.33, with one exception of P = .052 and
2 = 3.77 for residents and cancers alone; McNemar
2 test [The degree of freedom for the McNemar
2 test is always 1.]). The reduction was not statistically significant for benign cases alone (P > .08,
2 < 3.00; McNemar
2 test).
|
| DISCUSSION |
|---|
|
|
|---|
is widely used as a quantitative measure of agreement, it is not without limitations when the findings of different studies are compared (20). More important, there is no explicit relationship between
statistic and ROC analysis; the latter is often used to quantify diagnostic accuracy. Therefore, we extended our calculations beyond determining the
value to three separate but related analyses. First, we calculated the variability that is evident in the ROC summary indices. This analysis could serve as a direct link between the calculation of variability and the calculation of diagnostic accuracy by means of ROC analysis. Second, we calculated the
value and the pattern of agreement. Third, we assessed variability from the points of view of the referring physician and the patient by using a calculation in the literature (1). Each of the three analyses addressed a different aspect of variability, and together they helped to define its magnitude and the ability of CAD to help reduce the variability.
Our analysis revealed that there is considerable variability in the interpretation of mammograms by radiologists; this finding is consistent with that of other studies (14). We found similar or poorer agreement between radiologists, compared with the results of Elmore et al (1). Elmore et al reported a per-patient substantial-disagreement frequency of 25%, which is similar to our result of 23% for attending radiologists. However, our
values were generally lower; these values suggested poorer agreement. This result may have been caused by differences in the calculation of
values (ie, averaging two-reader
values [1] vs calculating multireader
values); also, our study (8) did not include cases that were not evaluated at biopsy. To increase the statistical power of the study by enhancing the proportion of cases that were difficult to diagnose, only abnormal cases that had biopsy confirmation were used in our study (11). These difficult cases can be presumed to generate more variability in interpretation. We found a range of 35% in sensitivity and a range of 44% in specificity. These results agree with the ranges of 53% in sensitivity and 45% in specificity reported by Beam et al (2), who studied the results of 108 radiologists.
Two sources can potentially generate variability in the interpretation of mammograms. First, variations in diagnostic accuracy (ie, variations in radiologists abilities to correctly diagnose cancerous and cancer-free lesions) may be a primary source of variability. Az values vary as a result of this variation. Second, a radiologists selection of a decision threshold that defines a positive diagnosis in his or her interpretation can also produce variability (6). A decision threshold is necessary in all binary diagnostic tasks, and its selection is influenced by a radiologists perception of disease prevalence and the benefits and costs associated with correctly diagnosing the disease (14). Although selection of different thresholds causes sensitivity and specificity to vary simultaneously and in opposite directions, such variations are not caused by and do not represent variations in diagnostic accuracy (6). Selection of the different thresholds does not cause Az values to vary because an ROC curve, for which Az is a summary index, depicts all of the tradeoffs available as the threshold is varied. Therefore, selection of the decision threshold is an issue that is separate from the variation in diagnostic accuracy as quantified with Az values.
Our results (Fig 1) showed that the sensitivity and specificity data points were on or near the average ROC curves; these results indicated that much of the variation in sensitivity and specificity was caused by the use of different decision thresholds during interpretation and not by variations in diagnostic accuracy. This is consistent with the interpretation of DOrsi and Swets (6) of the results of Elmore et al (1). As one might expect, the similarity in the ranges of sensitivity and specificity with and without use of a computer aid indicated that CAD had little influence on the radiologists choices of decision thresholds, since CAD is not expected to influence the radiologists perception of disease prevalence and the benefits and costs associated with correctly diagnosing the disease. The improvement in accuracy achieved with CAD is a result of the radiologists being able to improve their performance, as reflected with a different (higher) ROC curve. Moreover, compared with the without-aid data, the decreased dispersion of the with-aid sensitivity and specificity data points from the average ROC curve shows that CAD helped the radiologists to interpret mammograms with a more uniform, as well as higher, level of accuracy. Therefore, although CAD caused little change in the ranges of sensitivity and specificity (which in our study appear to have been determined largely by the radiologists choice of decision thresholds), our results showed that CAD helped the radiologists to reduce variation in their diagnostic accuracy.
We analyzed data for attending radiologists and residents both in aggregate and in two separate groups, and we found that the results were similar except in the frequencies of substantial disagreement (Fig 3). We believe the residents data are clinically relevant because the majority of recent residents and fellows who go into private practice are assigned to reading screening mammograms. Although on may interpret data from attending radiologists and residents differently, inclusion or exclusion of the residents data did not alter the findings of this study.
Comparison with Other Observer Study Data
We compared our findings with those of eight other investigations (5,2127) of the effects of CAD on observer performance. By re-analyzing the results of these studies, we deduced general conclusions that are not limited to a particular computer aid or imaging task, as these were different in each of the studies; rather, our conclusions pertain to CAD in general. We used the accuracy indices (Az in all studies except one) and corresponding SDs that were reported in the original investigations as measures of diagnostic accuracy and variability. The results (Table 6) showed that accuracy was always higher and that its SD was always smaller when a computer aid was used; these results indicated that accuracy was consistently improved and that variability was consistently reduced when a computer aid was used. Although these studies were not specifically designed to measure the effect of a computer aid on reader variability, the clear trend of a reduction in reader variability in all of these nonuniformly designed studies indicates that the reduction is likely a consequence of, rather than a coincidental finding with, use of a computer aid.
|
Impediments to the clinical use of CAD include the radiologic communitys underestimation of the extent of individual variability in daily practice and the effects that missing important low-prevalence events or overreacting to common benign conditions has on screening. Recent studies have focused attention on these problems and have created an appreciation of the need for more standardization of the observers role in the screening process.
In summary, a CAD joint reading could promote agreement and eliminate some of the extreme or erroneous diagnostic opinions. Both of these outcomes are highly desirable in the medical, social, and economic contexts of breast cancer screening in an asymptomatic population.
Two major conclusions can be drawn from our data and our findings from analysis of nine independent observer-performance studies (5,8,2127): CAD can improve diagnostic performance, and CAD can simultaneously reduce interpretation variability. These beneficial effects are possible because CAD can help radiologists to avoid performing biopsy in benign lesions, while it increases, rather than decreases, the number of correct diagnoses of cancers. The second capability is a substantial enhancement to the known potential of CAD, which has been demonstrated in several studies. Our findings suggest that if CAD is incorporated into clinical radiology, improvements in both accuracy and consistency in image interpretation can be expected. Patients and referring physicians would agree that both of these goals are highly desirable. These goals support the intention of the Breast Imaging Reporting and Data System lexicon introduced by the American College of Radiology and the Mammography Quality Standards Act, that is, to improve the daily practice and results of breast cancer screening by fostering more uniform interpretations.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
3 Current address: Ctr for Statistical Sciences, Brown Univ, Providence, RI. ![]()
This work was performed as part of the International Digital Mammography Development Group. The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of any of the supporting organizations.
Abbreviations: Az = area under the ROC curve, CAD = computer-aided diagnosis, ROC = receiver operating characteristic
Author contributions: Guarantor of integrity of entire study, Y.J.; study concepts, Y.J., R.M.N., R.A.S., K.D.; study design, Y.J., R.M.N., R.A.S.; literature research, Y.J., A.Y.T.; experimental studies, Y.J.; data acquisition, Y.J.; data analysis/interpretation, Y.J., A.Y.T.; statistical analysis, A.Y.T.; manuscript preparation and editing, Y.J.; manuscript definition of intellectual content, revision/review, and final version approval, all authors.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Suzuki, Y. Nakamoto, T. Terauchi, M. Kawamoto, Y. Okumura, Y. Suzuki, T. Sato, N. Takahashi, J. Lee, M. Senda, et al. Inter-observer Variations in FDG-PET Interpretation for Cancer Screening Jpn. J. Clin. Oncol., August 18, 2007; (2007) hym064v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Kato, M. Kanematsu, X. Zhang, M. Saio, H. Kondo, S. Goshima, and H. Fujita Computer-Aided Diagnosis of Hepatic Fibrosis: Preliminary Evaluation of MRI Texture Analysis Using the Finite Difference Method and an Artificial Neural Network Am. J. Roentgenol., July 1, 2007; 189(1): 117 - 122. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Skaane, A. Kshirsagar, S. Stapleton, K. Young, and R. A. Castellino Effect of Computer-Aided Detection on Independent Double Reading of Paired Screen-Film and Full-Field Digital Screening Mammograms Am. J. Roentgenol., February 1, 2007; 188(2): 377 - 384. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. L. Partain, H.-P. Chan, J. G. Gelovani, M. L. Giger, J. A. Izatt, F. A. Jolesz, K. Kandarpa, K. C. P. Li, M. McNitt-Gray, S. Napel, et al. Biomedical Imaging Research Opportunities Workshop II: Report and Recommendations Radiology, August 1, 2005; 236(2): 389 - 403. [Full Text] [PDF] |
||||
![]() |
E. E. Deurloo, S. H. Muller, J. L. Peterse, A. P. E. Besnard, and K. G. A. Gilhuijs Clinically and Mammographically Occult Breast Lesions on MR Images: Potential Effect of Computerized Assessment on Clinical Reading Radiology, March 1, 2005; 234(3): 693 - 701. [Abstract] [Full Text] [PDF] |
||||
![]() |
K Doi Current status and future potential of computer-aided diagnosis in medical imaging Br. J. Radiol., January 1, 2005; 78(suppl_1): S3 - s19. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. V. Destounis, P. DiNitto, W. Logan-Young, E. Bonaccio, M. L. Zuley, and K. M. Willison Can Computer-aided Detection with Double Reading of Screening Mammograms Help Decrease the False-Negative Rate? Initial Experience Radiology, August 1, 2004; 232(2): 578 - 584. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. A. Krupinski Computer-aided Detection in Clinical Environment: Benefits and Challenges for Radiologists Radiology, April 1, 2004; 231(1): 7 - 9. [Full Text] [PDF] |
||||
![]() |
M. A. Helvie, L. Hadjiiski, E. Makariou, H.-P. Chan, N. Petrick, B. Sahiner, S.-C. B. Lo, M. Freedman, D. Adler, J. Bailey, et al. Sensitivity of Noncommercial Computer-aided Detection System for Mammographic Breast Cancer Detection: Pilot Clinical Trial Radiology, April 1, 2004; 231(1): 208 - 214. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Ikeda, R. L. Birdwell, K. F. O'Shaughnessy, E. A. Sickles, and R. J. Brenner Computer-aided Detection Output on 172 Subtle Findings on Normal Mammograms Previously Obtained in Women with Breast Cancer Detected at Follow-Up Screening Mammography Radiology, March 1, 2004; 230(3): 811 - 819. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gur, J. H. Sumkin, H. E. Rockette, M. Ganott, C. Hakim, L. Hardesty, W. R. Poller, R. Shah, and L. Wallace Changes in Breast Cancer Detection and Mammography Recall Rates After the Introduction of a Computer-Aided Detection System J Natl Cancer Inst, February 4, 2004; 96(3): 185 - 190. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |