|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Breast Imaging |
1 From the Departments of Radiology (D.G., J.S.S., L.A.H., B.Z, J.H.S., D.M.C., B.E.S) and Biostatistics (H.E.R.) and Magee-Womens Hospital (D.G., L.A.H., J.H.S., D.M.C., B.E.S.), University of Pittsburgh, 300 Halket St, Suite 4200, Pittsburgh, PA 15213-3180. Received February 16, 2004; revision requested April 20; revision received May 5; accepted May 24. Supported in part by grants CA77850 and CA84241 from the National Cancer Institute, National Institutes of Health, and also by the U.S. Army Medical Research Acquisition Center under contract DAMD17-00-1-0410. Address correspondence to D.G. (e-mail: gurd@upmc.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Two hundred nineteen film-based mammographic examinations, classified into five groups, were included in this study. Group 1 included 58 examinations in which verified malignant masses were detected during screening; group 2, 39 in which all available latest examinations were performed prior to diagnosis of these malignant masses (subset of 39 women from group 1); group 3, 22 in which findings were interpreted as negative but were verified as cancer within 1 year from the negative interpretation (missed cancers); group 4, 50 in which findings were negative and patients were not recalled for additional procedures; and group 5, 50 in which patients were recalled for additional procedures and findings were negative for cancer. In all examinations, images were processed with two Food and Drug Administrationapproved commercially available CAD systems and an in-house scheme. Performance levels in terms of true-positive detection rates and number of false-positive identifications per image and per examination were compared.
RESULTS: Mass detection rates in positive examinations (group 1) were 67%72%. Detection rates among three systems were not significantly different (P > .05). In 50 negative screening examinations (group 4), false-positive rates ranged from 1.08 to 1.68 per four-view examination. Performance level differences among systems were significant for false-positive rates (P = .008). Performance of all systems was at levels lower than publicly suggested in some retrospective studies. False-positive CAD cueing rates were significantly higher for negative examinations in which patients were recalled (group 5) than they were for those in which patients were not recalled (group 4) (P
.002).
CONCLUSION: Performance of CAD systems for mass detection at mammography varies significantly, depending on examination and system used. Actual performance of all systems in clinical environment can be improved.
© RSNA, 2004
Index terms: Breast neoplasms, diagnosis, 00.32 Cancer screening Computers, diagnostic aid
| INTRODUCTION |
|---|
|
|
|---|
In recent years, major efforts have been expended to develop computer-aided detection (CAD) systems that will help radiologists with breast cancer detection. The hope is that these systems will serve as a second reader and will help improve sensitivity without a substantial increase in recall rates and at the same time possibly decrease reader variability, as well. These systems are currently aimed at the early detection of cancer and are accordingly designed to assist the radiologist in detection of suspicious regions depicted as clustered microcalcifications and masses (911). Computer-aided diagnosis systems are also being developed to assist radiologists in the classification task, namely, the determination of whether or not an identified finding is likely to represent a malignancy (1113). The Food and Drug Administration has approved several detection systems for routine clinical use, and Medicare and other insurance companies have approved reimbursement for their use in clinical practice.
Results of studies (14,15) suggest that the use of CAD systems could potentially increase cancer detection rates by as much as 20% without a significant increase in recall rates. To date, there are limited data on the actual effect of the prospective use of such systems in the clinical environment (16,17). There is some evidence that the performance of radiologists, at least in the laboratory setting, is affected by the performance of the CAD scheme itself (18). Hence, a high level of performance is an important factor in the ultimate clinical success of CAD.
Data for comparison of the performance of CAD systems applied to the same set of cases are limited (1922). The purpose of our study, therefore, was to compare the performance of two FDA-approved commercially available CAD systems and an in-housedeveloped scheme in five groups of sequentially acquired screening mammograms.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The data sources for the selection of examinations were databases of procedure scheduling, procedure completion, radiology reporting, and procedure-related outcomes as determined from relevant pathology reports.
Group 1 included 58 examinations performed in women with biopsy-proved cancer that initially had been identified as a mass by a radiologist in our group during a screening examination in 2002. Images were selected sequentially from our procedure-related outcome database by a staff member (J.S.S.) who did not have any prior knowledge of the specific details about the patient or of the visual characteristics of the depicted mass.
In addition, there was an interest in the performance of CAD systems applied to examinations performed 1 year prior to observation of a positive finding. Group 2 hence included 39 available latest negative prior examinations (subset of 39 women from group 1 who underwent a different examination formed group 2) performed during or prior to 2001 that had been performed before the screening examination that led to a finding positive for cancer.
Group 3 included 22 consecutive false-negative examinations in which images depicted masses in retrospect. In 21 examinations, one mass in each was depicted on images, and in one examination two masses were depicted, which produced a total of 23 masses. Findings in these examinations were defined in our practice as false-negative interpretations. Findings in these examinations had been interpreted as negative or benign (Breast Imaging Reporting and Data System category 1 or 2) during the screening examination and were biopsy proved as positive for cancer, with a mass depicted on subsequent mammograms obtained within 1 year of the negative examination. These examinations constitute a different set of cases and are not a subset of the 39 prior examinations described previously as group 2.
Group 4 included 50 verified negative examinations (Breast Imaging Reporting and Data System category 1 or 2) that were selected randomly by the same staff member who selected those in group 1 from the examinations performed during two preselected dates in 2002 (March 1 and 2, 2002). Findings in all of these examinations were verified with findings at a 1-year follow-up screening examination that were interpreted as negative. A 1-year follow-up examination was the latest available examination in these women.
Group 5 included 50 consecutive examinations in which patients had been recalled during April 2002 (Breast Imaging Reporting and Data System category 0). Results of the diagnostic work-up that followed were negative or benign (Breast Imaging Reporting and Data System category 1 or 2), and results of the work-up for the annual examination in 2003 were negative, as well.
As a result, a total of 219 examinations in 180 women were included in the study. The median age of the women whose examinations were used in this study was 54.5 years, with a range of 3887 years.
Evaluation of Masses
All examinations were reviewed by several investigators (D.G., J.H.S., L.A.H., J.S.S.) together with source documents to generate a truth file that included depicted findings for the examinations in question. The boundaries of the masses were drawn subjectively and conservatively (approximately 5 mm larger than the depicted masses in all directions) on the image obtained at the examination performed in 2002 that resulted in the finding and on the corresponding areas on the images obtained at the prior examinations, when applicable. If masses were depicted with spiculations, these were included in the mass region. Hence, the allowed target was larger in all directions than was the depicted mass. This selection for the increased size of the target was arbitrary and increased the marked regions, in some cases substantially, because mass contours with the expectation that any identification (detection) by the CAD system close to the actual mass would not be disregarded by the interpreting radiologists. It also allowed position changes at the prior examination to be more conservatively accounted for because of the larger allowed target for detection.
For each examination, processing was performed with three CAD systems. One system (ImageChecker M1000, version 3.1; R2 Technologies) was used routinely in our clinical practice and was the system with which processing had been performed in all of the examinations during the original clinical interpretation. Another system (Second Look, version 6.0 Beta; CADx Systems, Beavercreek, Ohio) was used to process all images as well. A third system was an in-housedeveloped scheme, and its use has been reported in the past (2325).
To ensure that there was no bias in the results, with the exception of the fact that the initial selection may have been affected somewhat by the use of the system that we used during the initial clinical interpretation, we fixed the detection threshold for determination of suspicious regions on the in-house system. This was done to provide a binary output in our own scheme (identified regions were either marked or not marked), which was similar to that of the commercial system, rather than a continuous output (01). Hence, we provided an automated operation (no operator decisions or options) to an experienced staff member (J.S.S.) who had processed images in several thousands of examinations with both commercial systems during the past 3 years and who processed all the images used in this study with all three systems.
The digitized images (model 861; Howtek, Hudson, NH) obtained with the Second Look system were then transferred to the in-house scheme and processed in exactly the same manner. A true-positive finding detected by the CAD system was attributed to each mark (cued region) noted by the CAD system if the center of the marked region was overlapping in any way (within the boundary of the conservatively drawn contour) with the recorded mass area in the manually drawn truth file. Otherwise, the CAD system markings were considered false-positive findings. This task was performed by one staff person (the same experienced staff person mentioned previously) to avoid interoperator biases. Biases, if any, were assumed to be consistent for all three systems, and this assumption enabled a relative comparison among them, even if there were some biases in absolute terms.
Statistical Analysis
True and false findings were tabulated for all examinations. Both breast-based (on either of the mammographic views) and image-based (each image considered as an independent examination) detections were recorded, and detection rates per breast and per image, as well as false-positive rates per examination (all four mammographic views), were computed. The three systems were compared for detection levels (sensitivity) by using a repeated-measures binary-response model in which there were three replicates, one for each patient according to each of the three modalities. The average of false-positive cues provided among the three systems was compared by using Friedman two-way analysis of variance. The number of false-positive findings that were detected in negative screening examinations and those in examinations for which patients were recalled were compared by using the Mann-Whitney U test. All analyses were performed with software (SAS, version 8.2; SAS Institute, Cary, NC). For each modality, the difference in false-positive rates between negative screening examinations and those in which patients were recalled was compared, assuming independent Poisson distributions. All statistical tests were two sided, and a difference with P < .05 was considered significant.
| RESULTS |
|---|
|
|
|---|
|
|
| DISCUSSION |
|---|
|
|
|---|
It is not reasonable to expect, especially on a worldwide level, that film images will quickly be totally replaced by digital images. For that reason, most CAD systems currently in use must provide a method to digitize images, and this process in combination with differences in CAD algorithms may lead to problems in regard to standardization and reproducibility of results even when applied to a single system (2830). There is little doubt that differences in performance among CAD systems will remain. If we want to collect data that allow radiologists to improve the practice of screening mammography, it is important that we understand the possible effects that may result from using different CAD systems. There are few data about a comparison of performance of different CAD systems when applied to the same sets of examinations. As systems continue to evolve and improve, the results of such comparisons are valid only for the experimental conditions being implemented with the specific systems (eg, digitizer and software versions) that were studied. For that reason, the results of such studies, while interesting and possibly suggestive of the effects of system differences at a given time and for a specific distribution of examinations, may be obsolete within a short period. It is important, however, to recognize that there are differences (frequently substantial) in the performance levels of different CAD systems. If such differences affect radiologists during clinical interpretations of findings of screening mammographic examinations, one should be aware of them (18,31).
Lechner et al (19) compared two Food and Drug Administrationapproved CAD devices, ImageChecker M1000 (R2 Technologies) and Second Look. They found that 90% and 89% of abnormalities associated with cancers in 120 examinations were detected by the ImageChecker and Second Look systems, respectively. While 100% and 90% of the ten examinations with both masses and microcalcification clusters were detected with the two systems, respectively, only 84% and 82% of the 67 masses without clusters were identified with the two systems. Similar performance levels were reported in other studies (26), albeit no comparisons with other systems were made. A review of the findings from these studies, as well as of the Food and Drug Administration approval process, suggests the following: The performance of the two commercial systems is reasonably comparable for all practical purposes. If differences exist, they are small and would require large sample sizes to quantify them (32).
Our study is somewhat different in the examination selection process. We attempted to select a sequentially acquired, and potentially representative, sample of each type of examination to allow generalizability, at least to our own screening population. Recently, investigators in a study (20) reported that the patient-based sensitivity for detection of "actionable architectural distortion" with these two systems when applied to 45 examinations (in 43 patients) was less than 50% for either system. In another study of retrospectively reviewed prior examinations with findings that suggested "evidence of cancer on prior mammograms," approximately 50% sensitivity for mass detection (eight of 19 with Second Look and 12 of 19 with ImageChecker) on prior images was indicated (22).
Although our study is similar to that of Shile and Guingrich (22) in that we attempted to select a representative population of examinations, it differs in several respects. First, we included a series of all available sequentially acquired sets of examinations.
Second, our false-positive rate was computed from a set of negative examinations rather than from the same examinations in which a mass was found.
Third, our assessment of CAD performance in the five sets of examinations allows one to have a better perspective of the possible effect of CAD on clinical practices with each type of mammogram. In our study, performance of all systems was at somewhat lower levels than expected. This could be the result of several factors. These factors included, but were not limited to, the difficulty of detection of the "average" cancer with our screening program. The conservatively defined mass regions (targets) reduced the possibility of biases that would result from exact marking. The use of only one experienced person, who was not involved in our CAD development team, to rate the correct markings ensured consistency in the scoring. This should have decreased, if not completely eliminated, any biases in the relative comparison among the systems. At this level of performance, we showed that experienced radiologists do not substantially improve their mass detection performance levels in the laboratory (18), and we suspect that this might be the case in the clinical environment, as well (17). Interestingly, the false-positive rates for examinations in which patients were recalled but that later proved to be negative examinations (group 5) were higher than were the rates for negative screening examinations. This finding suggests that these mammograms are more difficult for the CAD, as well as for the human observer, to analyze correctly. The performance of all three CAD systems was not very high in both the sets of examinations with false-negative interpretations and prior examinations with actually positive interpretations. This finding suggests that, at least in our environment, the potential improvements in earlier detection of masses with the use of current CAD systems is perhaps somewhat limited. Although seemingly unimportant as long as detection rates are comparable, the false-positive rates may affect general radiologists reliance on the CAD results. High false-positive rates may result in low reader confidence in the CAD marking, since many cues have to be reviewed and discarded as negative findings (18).
In addition, there are some indications that performance in the noncued areas may be affected by the false-positive rate, as well (18). Because of the substantial difference in medicolegal liability between false-negative and false-positive interpretations, the effect of the CAD-generated false-positive cueing rate on noncued cancers may be an important issue to consider.
As to the lower performance of our own in-house scheme for CAD, we note that the scheme was originally designed and optimized for images digitized with a different digitizer (18,25), which has substantially different signal and noise characteristics. Also, our current scheme does not limit the total number of regions identified as suspicious per examination, as do other systems (33). Despite these limitations, it performed reasonably well in a direct comparison with two commercial systems.
Our study had several limitations. First, as previously indicated, our selection protocol may have been somewhat biased in favor of the ImageChecker system in that the images obtained in these examinations (with the exception of group 2) had been processed with this system during the initial clinical interpretation, and this bias possibly influenced the results of these examinations. However, our experience to date indicates that in our practice the changes were minor at best, particularly with respect to the detection of masses (17).
Second, the verification of negative examinations was based on findings at the subsequent screening examination. Although not optimal, this was the most recent available examination at the time, and we assumed that errors in this regard, if any, were not likely to affect the relative performance comparisons we described.
Third, the study was limited to the mammograms acquired at one institution and the masses detected by one group of radiologists. However, we do not believe that this limitation affected the results in a manner that would substantially affect similar comparisons at other institutions.
Fourth, our conservative approach to generation of the targets (ie, drawing the mass regions) may have affected the results. However, we verified that this effect was not substantial (<5% in this set of cases) and did not affect the comparison of relative performance levels of the three CAD systems.
Fifth, it could be argued that one of the limitations of the study was that we tested complete systems and not the software scheme alone. Hence, the comparison could have been affected by the digitizers in the two commercial systems we used. The fact is that a commercial CAD system is integrated, and these systems were tested largely as they would be used in a clinical environment. In this study, we cannot comment on a comparison that would be based on testing of the software alone.
Last, our study focused on the detection of masses. The significantly higher performance of CAD systems in the detection of microcalcifications may be sufficient to warrant the routine use of these systems alone. Other nondetection issues, such as the assessment of possible efficiency improvements in the reading of mammograms because of the high performance in the detection of microcalcifications, were clearly beyond the scope of this study.
In summary, we observed somewhat lower than expected case-based and image-based detection rates with CAD for all three systems. This is not to indicate that CAD cannot help the radiologist, even at these levels of performance, in different clinical environments, particularly radiologists with less experience in the interpretation of screening mammograms. However, the level of improvement is not likely to be what had been estimated from retrospective studies in a laboratory environment. Results of this study clearly indicate that marked improvements in CAD performance levels for mass detection are both desired and possible, and continuing efforts should be expanded in this area.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Authors stated no financial relationship to disclose.
The content of the information contained herein does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred.
Author contributions: Guarantor of integrity of entire study, D.G.; study concepts and design, D.G., B.Z.; literature research, B.Z., J.S.S., H.E.R.; experimental studies, J.S.S., D.G.; data acquisition, J.S.S.; data analysis/interpretation, H.E.R., B.Z., D.G.; statistical analysis, H.E.R.; manuscript preparation, D.G., L.A.H., J.H.S., D.M.C., B.E.S.; manuscript definition of intellectual content, D.G., L.A.H., J.H.S., H.E.R., D.M.C., B.E.S.; manuscript editing, revision/review, and final version approval, all authors
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
F. Li, R. Engelmann, C. E. Metz, K. Doi, and H. MacMahon Lung Cancers Missed on Chest Radiographs: Results Obtained with a Commercial Computer-aided Detection Program Radiology, January 1, 2008; 246(1): 273 - 280. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Ellis, A. A. Meade, M. A. Mathiason, K. M. Willison, and W. Logan-Young Evaluation of Computer-aided Detection Systems in the Detection of Small Invasive Breast Carcinoma Radiology, October 1, 2007; 245(1): 88 - 94. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Ko, M. J. Nicholas, J. B. Mendel, and P. J. Slanetz Prospective assessment of computer-aided detection in interpretation of screening mammography. Am. J. Roentgenol., December 1, 2006; 187(6): 1483 - 1491. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Zheng, G. S. Maitz, M. A. Ganott, G. Abrams, J. K. Leader, and D. Gur Performance and Reproducibility of a Computerized Mass Detection Scheme for Digitized Mammography Using Rotated and Resampled Images: An Assessment Am. J. Roentgenol., July 1, 2005; 185(1): 194 - 198. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |