|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Breast Imaging |
1 From the Department of Radiology, University of Michigan Medical Center, CGC B2102, 1500 E Medical Center Dr, Ann Arbor, MI 48109-0904 (L.H., B.S., M.A.H., H.P.C., M.A.R., C.P., C.B., J.B., K.K., M.F., S.K.P., D.A., A.V.N., J.S.); and Center for Devices and Radiological Health, U.S. Food and Drug Administration, Rockville, Md (N.P.). Received December 10, 2004; revision requested February 3, 2005; revision received May 5; accepted June 13; final version accepted September 12. Supported by USAMRMC grants DAMD17-98-1-8211, DAMD17-02-1-0489, and DAMD1702-1-0214 and USPHS grant CA95153. Address correspondence to L.H. (e-mail: lhadjisk{at}umich.edu).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: The data collection protocol had institutional review board approval. Patient informed consent was waived for this HIPAA-compliant retrospective study. Ninety temporal pairs of two-view serial mammograms (depicting 47 malignant and 43 benign biopsy-proved masses) were obtained from 68 patient files and were digitized. Biopsy was the reference standard. Eight Mammography Quality Standards Act of 1992accredited radiologists and two breast imaging fellows assessed digitized two-view temporal pairs (in preselected regions of interest only) by estimating likelihood of malignancy and Breast Imaging Reporting and Data System (BI-RADS) category without and with CAD. Observers' rating data were analyzed with Dorfman-Berbaum-Metz (DBM) multireader multicase method. Statistical significance of differences was estimated with the DBM method and Student two-tailed paired t test.
Results: Average area under the receiver operating characteristic curve for likelihood of malignancy across the 10 observers was 0.83 (range, 0.740.88) without CAD and improved to 0.87 (range, 0.800.92) with CAD (P < .05). The average partial area index above a sensitivity of 0.90 for likelihood of malignancy was 0.35 (range, 0.130.54) without CAD and 0.49 (range, 0.180.73) with CADa nonsignificant improvement (P = .11). For BI-RADS assessment, it was estimated that with CAD, six radiologists would correctly recommend additional biopsies for malignant masses (range, 4.3%10.6%) and five would correctly recommend reduction of biopsy (ie, fewer biopsies) for benign masses (range, 2.3%9.3%). However, five radiologists would incorrectly recommend additional biopsy for benign masses (range, 2.3%14.0%), and one would incorrectly recommend reduction of biopsy (4.3%).
Conclusion: CAD involving interval change analysis of preselected regions of interest can significantly improve radiologists' accuracy in classifying masses on digitized screen-film mammograms as malignant or benign.
© RSNA, 2006
| INTRODUCTION |
|---|
|
|
|---|
Multiple views and multiple prior studies are routinely reviewed by radiologists in forming a mammographic interpretation. The use of serial mammograms for evaluating interval changes has been found to increase the sensitivity of breast cancer detection (5,8). Recently, Burnside et al (9) and Sumkin et al (10) reported that in a diagnostic setting, comparison with results of a prior examination significantly (P < .001) increased the overall cancer detection rate and the correct recall of patients for additional procedures.
In recent years, a number of computer-aided diagnosis (CAD) techniques for characterization of mammographic lesions in a single mammographic examination were developed (5,1118). Receiver operating characteristic (ROC) studies were performed to evaluate the effects of CAD on radiologists' accuracy in the characterization of malignant and benign masses (18,19) and microcalcification clusters (16) on single- and multiple-view mammograms obtained in a single examination. In all of these studies, the radiologists' performance in terms of the area under the ROC curve (Az) achieved statistically significant improvement (P < .05) when they read with computer aid versus when they read without aid.
We previously (20) developed a classification scheme based on mammograms from multiple examinations. The classifier combines prior and current information that is automatically extracted from masses on serial mammograms. It performed significantly better (P = .015) in terms of Az than did the classifier that used current information alone. We conducted an observer performance study in which radiologists estimated the likelihood of malignancy for masses on single-view serial mammograms (21). The accuracy of the radiologists in the characterization of malignant and benign temporal pairs was significantly improved (P = .005) with CAD versus without CAD.
An important question that we attempted to address in our current study is to what extent CAD influences radiologists' diagnostic recommendations when more mammographic information is available for a case. Thus, the purpose of our study was to retrospectively evaluate the effects of a CAD program that involves an interval change classifier (a classifier that uses interval change information extracted from prior and current mammograms and estimates a malignancy rating) on radiologists' accuracy in characterization of masses on two-view serial mammograms as malignant or benign.
| MATERIALS AND METHODS |
|---|
|
|
|---|
For 17 patients, the current mammograms were diagnostic images. There were two patients in whom both the current and the prior mammograms were acquired during diagnostic examinations. Six of the 17 patients with diagnostic images had a palpable mass. The cases were collected from our database of patients who had undergone breast biopsy in our department. We used all available cases that satisfied the above criteria. The current mammograms for the cases used in the study were collected (by L.H., H.P.C., B.S., and N.P.) from December 1995 to February 2001. The patients ranged in age from 37 to 86 years (mean age, 59.9 years).
The mammograms were digitized by using a LUMISCAN 85 laser scanner (Lumisys, Los Altos, Calif) at a pixel resolution of 50 x 50 µm and 4096 gray levels. Because masses are large and relatively noisy objects, they do not require high spatial resolution, and their characterization may be improved by reducing the noise. The images were smoothed by using a 2 x 2 box filter and were down sampled by a factor of two, resulting in images with a pixel size of 100 x 100 µm for further analysis. The pathologic nature of all masses was proved with biopsy. Biopsy results were considered the reference standard.
The true mass locations on all mammograms were identified by a Mammography Quality Standards Act of 1992accredited radiologist (M.A.H.) with 17 years of experience in reading mammograms. When a mass was not discretely visible on the previous mammogram, the radiologist estimated the area where the mass would develop by comparing the previous mammogram with the current mammogram. Prior mammograms for all study cases had been interpreted prospectively in terms of Breast Imaging Reporting and Data System (BI-RADS) lexicon categories 1, 2, and 3 by the radiologists who initially interpreted the studies at the time of the patients' clinical examination. A region of interest centered at the identified mass location and containing the mass or estimated mass area (for masses that were not discretely visible on the prior mammogram) was extracted from each mammogram. The sizes of the regions of interest were variable to enclose masses and estimated mass areas of different sizes and were large enough to include a strip of breast parenchyma of at least 5 mm in width surrounding the mass. The region-of-interest images were used in the computerized analysis and observer study. They are referred to as the mammograms in the following discussion.
A total of 300 mammograms containing CC and MLO views from serial examinations were obtained from the data set; from these, 90 two-view temporal pairs were formed, of which 47 were malignant and 43 were benign. A two-view temporal pair consisted of four mammograms: the CC and MLO views from a prior examination and the CC and MLO views from a current examination in the same patient. For the purpose of computerized analysis, two temporal pairs were obtained from each two-view temporal pairthe CC temporal pair (combining the current and prior CC views) and, similarly, the MLO temporal pair.
All masses had been sampled with core needle or excisional biopsy to establish their histologic features during the patients' clinical care. The average size (ie, longest diameter) of the malignant masses, at retrospective review, was 7.7 mm (range, 322 mm) on the prior mammograms and 12.5 mm (range, 442 mm) on the current mammograms. The corresponding sizes were 9.7 mm (range, 423 mm) and 11.6 mm (range, 530 mm), respectively, for the benign masses. The average size of the current and prior masses was estimated by using the two-view current and prior mammograms, respectively, on which the mass was visible in at least one view (CC or MLO). The average size of the current masses was estimated by using all temporal pairs. The average size of the prior masses was estimated by using 81 temporal pairs (for nine of the malignant temporal pairs, the mass was not visible on any of the CC or MLO views from prior mammograms).
Seven additional two-view temporal pairs depicting normal dense breast tissue that was deemed to mimic mammographic masses by the experienced radiologist (M.A.H.) were mixed into the data set read by the radiologists in the observer study. The seven two-view temporal pairs depicting normal breast tissue were randomly selected from images of the contralateral breast of seven of the 68 patients whose data were included in this study. The observers were informed of the presence of normal tissues, but the proportion was not disclosed. In this way, a slightly more realistic clinical situation in which the observers would have to distinguish normal breast tissue from masses, as well as malignant from benign masses, was simulated. The BI-RADS category 1 (negative) could be chosen. However, the ratings for the normal tissue pairs were not included in the data analyses because this study focused on the characterization of benign versus malignant masses rather than on differentiation of true masses from false masses.
The temporal pairs had a time interval of 648 months (Fig 1). When the radiologist identified the location of the mass, he or she also rated the visibility of the masses on the mammograms relative to the visibility of masses encountered in clinical practice on a 10-point scale, with 1 representing the most obvious and 10 the most subtle masses.
|
The two-view classifier was designed on the basis of the single-view CC and MLO temporal pairs classification. A "leave-one-case-out" resampling method was used to obtain test scores for the temporal pairs. Sixty-eight training-test partitions were obtained. Stepwise feature selection was applied to the training subsets to reduce the size of the input feature space. A test classifier score was obtained for each single-view CC or MLO temporal pair. For the application to two-view analysis, we obtained a single score for each two-view temporal pair by merging the test classifier scores of the corresponding CC and MLO single-view temporal pairs. We compared three different ways to merge the CC and the MLO single-view temporal pair scores: selecting the minimum between the two, selecting the maximum between the two, and calculating the average of the two.
Relative Computer Malignancy Rating of Masses
In our observer study, the average test classifier scores were linearly transformed to a scale from 1 to 10. The scores were rounded to the closest integer before being presented to the radiologists. This scale was more practical and intuitive for the observers than the original classifier scores, which were real numbers ranging from 3.5 to +2.6. Gaussian functions were fitted to the distributions of the transformed scores of the malignant and benign masses to yield an estimation of the classifier performance. The accuracy of the fit was estimated by using the Kolmogorov-Smirnov test. The radiologists evaluated the temporal pairs by using a graphical user interface. When the radiologist evaluated the cases by using CAD, the fitted distribution was displayed on the interface as a reference.
Observer Performance Study
In the observer performance study, both the CC and MLO view temporal pairs for each mass (Fig 2) were presented to the radiologist at the same time on the workstation. The 100 x 100-µm pixel size images were displayed. The radiologist evaluated the displayed two-view temporal pairs and provided an assessment by using two methods: First, an estimate of the likelihood of malignancy on a 100-point scale (where a score of 1 indicates a benign mass; a score of 100, a mass with a high likelihood of malignancy) and second, an assessment of the mass based on the BI-RADS malignancy ratings (where a score of 1 indicates a negative mass; a score of 2, a benign mass; a score of 3, a probably benign mass; a score of 4, a suspicious mass; and a score of 5, a mass highly suggestive of malignancy). The use of BI-RADS category 0 was not allowed, forcing the radiologist to make a "final" assessment.
|
Eight Mammography Quality Standards Act of 1992accredited radiologists (M.A.R., C.B., C.P., J.B., K.K., S.K.P., D.A., and A.V.N.) with experience in mammography that ranged from 3 to 24 years and two breast imaging fellows (M.F. and J.S.) participated as observers in this study. The radiologist (M.A.H.) who selected the cases and identified the masses did not participate in the observer experiment.
The 90 two-view temporal pairs of masses and the seven temporal pairs of normal tissue were divided into two case groups. Each observer read the 194 cases (97 cases times two reading conditions) in two reading sessions that were separated by at least 1 month. In each session, one case group was read with the independent mode and the other was read with the sequential mode. The order of the two reading conditions was switched between the reading sessions, and the order of the cases within each case group was randomized for each observer. The orders of the case groups and the reading conditions were arranged in a counterbalanced design such that no one case group or reading condition would be read or applied first more often than another when averaged over all observers.
Additional Comparisons
The observer performance results for the two-view temporal pairs of masses were compared with those for the single-view temporal pairs (L.H.). For this comparison, the radiologists' ratings for the 180 single-view temporal pairs (90 CC and 90 MLO views) obtained from a previous single-view experiment (21) were analyzed. These single-view temporal pairs corresponded to the same 90 two-view temporal pairs in the current experiment, and the 10 readers in the previous study were the same readers as in the current two-view experiment. The single-view observer study was performed separately 3 months before the two-view observer study so that the effects of memorization or learning would be minimal.
A further comparison was made (L.H.) by deriving a set of simulated two-view temporal pair observer ratings for the same 90 masses by artificially combining the two single-view observer ratings (by averaging and rounding) into a two-view rating for each mass. The classification performance achieved by using the 90 simulated two-view ratings was then compared with that achieved by using the 90 two-view ratings directly from the radiologists' reading in the current experiment. This comparison provided some insight as to whether the radiologists might use a "worst-case scenario" in estimating the likelihood of malignancy on the basis of the two views of the mass.
Statistical Analysis
The observer performances were analyzed (by L.H., B.S., H.P.C., and N.P.) in terms of the likelihood of malignancy, as well as in terms of the BI-RADS ratings, for the different modalities and the different radiologists. The Dorfman-Berbaum-Metz multireader multicase method (27) was applied to the radiologists' likelihood-of-malignancy ratings. The Dorfman-Berbaum-Metz method takes into account both the observer and the case sample variations by means of an analysis of variance approach so that the results of this analysis can be generalized to the population of observers, as well as to the population of case samples. The ROC curve was derived from a maximum likelihood estimation of the binormal distributions fitted to the observers' rating data, and the Az value and the partial area index (28) above a sensitivity threshold of 0.90 (Az(0.90)) were calculated. The statistical significance of the difference between the studied modalities was estimated by using the Dorfman-Berbaum-Metz method and the Student two-tailed paired t test for the observer-specific paired data. Additionally, we used the Obuchowski method (29) for analysis of clustered data. The Obuchowski method, which was also generalized by Lee and Rosner (30) for multireader, multimodality studies, accounts for the possible correlations between temporal pairs when multiple prior examinations are available and more than one temporal pair is formed from the multiple serial examinations of the same patient. The method is nonparametric and is robust to a variety of intracluster correlation patterns, as well as to nonnormally distributed test results.
The radiologists' recommended action for a given mass was determined by the BI-RADS rating provided in the observer experiment. We considered two different groupings, callback and biopsy, as follows. For the callback grouping, the cases with BI-RADS ratings of 1 or 2 were grouped as "normal," and cases with BI-RADS ratings of 3, 4, or 5 were grouped as callback. For the biopsy grouping, the cases with BI-RADS ratings of 1, 2, or 3 were grouped as "no biopsy," and cases with ratings of 4 and 5 were grouped as "biopsy recommended." After the radiologist evaluated the case with CAD, he or she could change the BI-RADS rating. For the callback grouping, if the BI-RADS rating for a case was changed from 1 or 2 to 3, 4, or 5, the change was considered to be from normal to callback. If the BI-RADS rating for a case was changed from 3, 4, or 5 to 1 or 2, the change was considered to be from callback to normal. Similarly, for the biopsy grouping, a change of the BI-RADS rating for a case from 1, 2, or 3 to 4 or 5 was considered to be a change from no biopsy to biopsy. If the rating for a case was changed from 4 or 5 to 1, 2, or 3, the change was considered to be from biopsy to no biopsy. The changes were counted over all of the cases and all observers and finally averaged by the number of observers. The McNemar test was used to evaluate the statistical significance of changes for the individual radiologists. For all analyses, a P value of less than .05 was considered to indicate a significant difference.
| RESULTS |
|---|
|
|
|---|
On average, seven features were automatically selected by the stepwise feature selection for each training-test partition. The seven features included two difference run-length statistics features, three current run-length statistics features, one spiculation feature from the current image, and one spiculation feature from the prior image.
The classification accuracy of the two-view computer classifier in terms of Az was 0.90 for the test data set. We found that the classifier accuracy was the highest when the CC and MLO single-view scores were averaged. This was consistent with our previous experience with (19) and reports in the literature about (18) merging scores in the case of multiple views. The average score was thus used in this observer study. The classifier test scores were linearly transformed to scores between 1 and 10 from the original range (3.5 to +2.6). The differences between the Gaussian fitted distributions and the transformed scores were not statistically significant for either malignant (P = .31, Kolmogorov-Smirnov test) or benign (P = .98, Kolmogorov-Smirnov test) masses.
Two-View Observer Performance
The two-view observer performance results are presented in Tables 14. The average ROC curves for the observers were obtained by averaging the fitted a and b parameters of the individual radiologist's ROC curve for each mode and then calculating an ROC curve from the average a and b parameters. The parameter a represents the vertical intercept, and b represents the slope of the fitted binormal ROC curve when it is plotted as a straight line on normal deviate axes. The average Az for radiologists was 0.83 for the independent mode, 0.82 for the sequential mode without CAD, and 0.87 for the sequential mode with CAD (Table 1, Fig 3). The observer performance for the reading with CAD was significantly improved compared with that for the independent reading mode (P = .03, Student paired t test; P < .05, Dorfman-Berbaum-Metz method; P = .01, Obuchowski method) and with that for the sequential mode without CAD (P < .01, Student paired t test; P < .01, Dorfman-Berbaum-Metz method; P = .01, Obuchowski method). There was a slight decrease in the performance for the sequential mode without CAD compared with the independent mode; however, the difference was not statistically significant (P = .10, Student paired t test; P = .89, Dorfman-Berbaum-Metz method; P = .52, Obuchowski method).
|
|
|
|
|
The average partial area index above a sensitivity of 0.90, Az(0.90), was 0.35 for the independent mode, 0.30 for the sequential mode without CAD, and 0.49 for the sequential mode with CAD (Table 2). There was an improvement in observer performance when reading with CAD compared with reading without CAD. However, only the improvement with the sequential mode with CAD versus the sequential mode without CAD was significant (P < .01, Student paired t test). When we compared results for the sequential mode without CAD with those for the sequential mode with CAD, we found that eight of the radiologists improved their performance with CAD in the high sensitivity range. For five of them, this improvement was significant (P < .05). For one radiologist, Az(0.90) did not change with CAD, and for another radiologist Az(0.90) declined but the difference was not significant (P = .7). When we compared results for the independent mode with those for the sequential mode with CAD, we found that seven radiologists achieved an improvement with CAD, and for three of them the improvement was significant (P < .001). For the remaining three radiologists, Az(0.90) declined with CAD, but not significantly (P > .58).
The radiologists' BI-RADS assessments for the three reading modes are shown in Tables 3 and 4. When we compared results for the sequential mode with CAD with those for the independent mode (Table 3), we found that four radiologists showed an increase in correct recommendations for callback with CAD (range, 2.1%6.4%). For four radiologists there was also a correct recommendation for reduction of callback (ie, fewer callbacks) (range, 2.3%4.6%). At the same time, however, for three radiologists there was an increase in incorrect callback for benign masses (range, 4.6%9.3%), and for two radiologists there was an incorrect reduction of callback for malignant masses (2.1%). When we compared results for the sequential mode without CAD with those for the sequential mode with CAD, we found that five radiologists showed an increase in correct recommendations for callback with CAD (range, 2.1%4.3%) and three radiologists made correct recommendations for reduction of callback (range, 2.3%4.6%). Two radiologists recommended additional incorrect callbacks for benign masses (range, 4.6%14.0%).
The correct biopsy recommendations were increased when radiologists read with CAD (Table 4). When the sequential mode with CAD was compared with the independent mode, six radiologists recommended additional correct biopsies of malignant masses (range, 4.3%10.6%), and five radiologists correctly recommended reduction of biopsy (ie, fewer biopsies) for benign masses (range, 2.3%9.3%). However, at the same time, five radiologists incorrectly recommended additional biopsy for benign masses (range, 2.3%14.0%), and, for one radiologist, there was an incorrect reduction of biopsy (4.3%). When the sequential mode without CAD was compared with the sequential mode with CAD, for seven radiologists there was an increase in correct recommendations for biopsy (range, 2.1%12.8%) and for three radiologists there was an increase in correct recommendations for reduction of biopsy (range, 2.3%7.0%). For four radiologists, there were additional incorrect biopsy recommendations for benign masses (range, 2.3%18.6%).
The above BI-RADS assessment results were not significantly different for most of the radiologists (McNemar test was performed for the results of each radiologist). Exceptions were for radiologists 9 and 4 for both callback and biopsy when the sequential mode without CAD was compared with the sequential mode with CAD (Tables 3, 4). Radiologist 9 had a significant increase in correct biopsy recommendations (P = .041), and radiologist 4 had a significant increase in incorrect reductions of both callback (P = .041) and biopsy (P = .013) recommendations.
Two-View Simulated Reading
We compared the use of minimum, maximum, and average to combine the observer ratings for the CC and MLO single-view temporal pairs into a two-view rating. It was found that the combined ratings obtained from averaging the CC and MLO single-view ratings achieved higher classification accuracy (in terms of Az values) than those of the combined ratings obtained from the minimum or maximum. This was also consistent with our previous experience (19) and with the literature (18) about merging scores in the case of multiple views. The observer results for the simulated reading of 90 two-view temporal pairs obtained by averaging the single-view ratings and the matched 180 single-view temporal pairs are presented in Tables 57.
|
Az(0.90) values for the simulated reading of two-view temporal pairs in independent mode, the sequential mode without CAD, and the sequential mode with CAD were 0.31, 0.35, and 0.44, respectively (Table 5). There was an improvement in Az(0.90) for the reading with CAD versus the readings without CAD. However, only the improvement compared with the sequential mode without CAD was significant (P < .02, Student paired t test).
The analysis of the BI-RADS assessments (Table 6) for the simulated two-view reading showed that, when reading with CAD was compared with the independent mode, three radiologists would correctly increase callback (2.1%) for malignant cases and four radiologists would correctly reduce callback (range, 2.3%9.3%) for benign cases. Two radiologists would incorrectly increase the callback (4.7%). When we compared results for the sequential mode with CAD with those for the sequential mode without CAD, we found that three radiologists would call back additional malignant cases correctly (2.1%) and two radiologists would correctly reduce the callback of benign cases (2.3%). One radiologist would incorrectly increase the callback (2.3%).
|
|
|
| DISCUSSION |
|---|
|
|
|---|
When we analyzed the radiologists' performance individually, we observed that some of them improved their Az performance with CAD, some of them showed no change in performance, and one showed a decline in performance. The largest portion of the radiologists improved their Az when reading with CAD. For at least half of them, the improvement was statistically significant (P < .04). A few radiologists did not change their Az performance with and without CAD, and for one radiologist the Az declined with CAD; however, the difference was not significant (P > .20).
The analyses of the changes of the BI-RADS assessments between the different modes did not reveal significant differences for most of the radiologists (the McNemar test was performed for the results of each radiologist). When we compared evaluation with CAD with evaluation without CAD we found that there were a number of radiologists who would recommend correct increase of callback or biopsy for malignant masses and correct reduction of callback or biopsy for benign masses. At the same time, there were also a number of radiologists who would recommend incorrect increase of callback or biopsy for benign masses and incorrect reduction of callback or biopsy for malignant masses. In general, however, the number of radiologists with correct recommendations was greater than the number of radiologists with incorrect recommendations.
The comparison of the results of the actual two-view reading to those of the simulated two-view reading showed that for the sequential mode with CAD, the Az values were close (0.87 and 0.88, respectively). Performance with the independent mode of two-view and simulated two-view readings was the same, with Az values of 0.83 for both. For the sequential modes without CAD, the Az value from the simulated two-view reading (ie, 0.86) was higher than that from the actual two-view reading (ie, 0.82). However, the difference was not significant (P = .08).
In the report of our previous ROC study with single-view temporal pairs (21), we discussed the fact that the observed change in the performance of the radiologists' reading in independent mode compared with reading in the sequential mode without CAD may reflect the possibility of subtle change in the behavior when that behavior is being studied. This phenomenon has also been observed and discussed by other researchers (3134). In the current observer study of reading the two-view temporal pairs, this phenomenon is not as obvious. When the likelihood-of-malignancy rating for the reading modes without CAD were analyzed, there was a slight decrease for the sequential mode without CAD (Az = 0.82) compared with the independent mode (Az = 0.83). The difference was not significant (P = .10, Student paired t test; P = .89, Dorfman-Berbaum-Metz method).
We did not observe a relationship between the radiologists' performance and their years of experience in breast imaging.
There were some limitations in our study. Ideally a classifier should be developed on the basis of a training data set and then be applied to an independent data set that is used to evaluate the radiologists' performance in the observer study. However, we were limited by the size of the data set of temporal pairs that were collected. A hold-out method of splitting the data set into training and testing subsets would have reduced the statistical power of the study. We therefore employed a leave-one-case-out resampling method to develop and test our classifier, and the resampled test set was used for the observer performance study. The leave-one-out resampling method is well established in the pattern-recognition literature as a statistically valid technique for estimating classifier performance in an unknown population. The test scores of the classifier were presented to the radiologists in the observer study. Furthermore, the purpose of this study was not to measure the absolute performance of the radiologists in comparison with the absolute performance of the classifier. Rather, our goal was to demonstrate that there was a relative improvement in the radiologists' performance when they used a computer classifier that had a reasonable performance as a second opinion. We believe that the use of a different data set will not change the conclusions, as long as the computer classifier has a reasonable performance.
Another limitation of the study was the fact that the radiologists evaluated regions of interest containing the masses, but not the entire breast. Although this was not a lesion-detection study, there is a possibility that radiologists' characterization accuracy without and with CAD might be different if the whole mammogram was evaluated rather than only a region of interest. On the other hand, if the whole mammogram is displayed, it is possible to have mixed effects from other confounding factors such as breast density, additional lesions, or the fact that different radiologists may use the breast parenchymal information to different extents, which would be difficult to account for.
A third limitation was the fact that we did not allow readers to use BI-RADS category 0. In clinical practice, the radiologists would require additional information such as that yielded by spot-compression magnification mammograms and ultrasonography before using BI-RADS category 3 or greater. However, in our ROC experiment, our purpose was to evaluate the effects of CAD on radiologists' assessment of masses on two-view temporal pair mammograms. The radiologists were asked to make a decision within the BI-RADS categories 15 on the basis of the information available on the serial mammograms. Although our study design did not take into account many possible factors in clinical practice, this is the first step in evaluating the effect of our CAD system within a focused goal. An ROC study with a limited goal also provides the advantage of gaining insight into the effects of individual factors without the presence of other confounding factors that mask the individual effects. It is noted that the results of a limited laboratory ROC study may not be directly applicable to clinical practice. Large prospective clinical trials will be needed to evaluate the effect of CAD on radiologists' diagnostic decisions in clinical settings.
In conclusion, we performed an observer study to evaluate the effects of CAD on radiologists' characterization of masses on two-view serial mammograms. We compared the performances of the radiologists with and without CAD when the available diagnostic information was increasedthat is, for two-view temporal pairs versus single-view temporal pairs. Our results demonstrate that at both the two-view and the single-view readings there was an improvement in the radiologists' performance when they were assisted by a computer classifier that had a performance in the range that we had studied.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: Az = area under the ROC curve Az(0.90) = partial area index above a sensitivity threshold of 0.90 BI-RADS = Breast Imaging Reporting and Data System CAD = computer-aided diagnosis CC = craniocaudal MLO = mediolateral oblique ROC = receiver operating characteristic
Authors stated no financial relationship to disclose.
Author contributions: Guarantor of integrity of entire study, L.H.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; literature research, L.H., B.S., M.A.H., H.P.C.; experimental studies, M.A.H., M.A.R., C.P., C.B., J.B., K.K., M.F., S.K.P., D.A., A.V.N., J.S.; statistical analysis, L.H., B.S., H.P.C., N.P.; and manuscript editing, all authors
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |