|
|
||||||||
Breast Imaging |
1 From the Department of Radiology, University of Aberdeen, Lilian Sutton Bldg, Foresterhill, Aberdeen, Scotland, AB25 2ZD (F.J.G., M.G.C.G.); Department of Imaging Science and Biomedical Engineering, University of Manchester, Manchester, England (S.M.A., P.M.G.); Department of Public Health and General Practice, Christchurch School of Medicine, Christchurch, New Zealand (M.A.M.); Nightingale Center, Withington Hospital, Manchester, England (C.R.M.B.); and Department of Epidemiology, Mathematics and Statistics, Wolfson Institute of Preventive Medicine, London, England (S.W.D.). Received June 29, 2005; revision requested August 25; revision received October 5; accepted October 19; final version accepted February 10, 2006. Supported by Cancer Research UK and the UK NHS Breast Screening Program. Address correspondence to F.J.G. (e-mail: f.j.gilbert{at}abdn.ac.uk).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: Local research ethics committee approval was obtained; informed consent was not required. This study included a sample of 10 267 mammograms obtained in women aged 50 years or older who underwent routine screening at one of two breast screening centers in 1996. Mammograms that were double read in 1996 were randomly allocated to be re-read by eight different radiologists using CAD. The cancer detection and recall rates from double reading and single reading with CAD were compared. Statistical significance and confidence intervals were calculated with the McNemar test to account for the matched nature of the data.
Results: Single reading with CAD led to a cancer detection rate that was significantly (P = .02) higher than that achieved with double reading: 6.5% more cancers were detected by means of single reading with CAD than by means of double reading. However, the recall rate was higher for single reading with CAD than for double reading (8.6% vs 6.5%, respectively; P < .001). This was equivalent to relative increases of 15% and 32% in the cancer detection and recall rates, respectively.
Conclusion: Single reading with CAD leads to an improved cancer detection rate and an increased recall rate.
© RSNA, 2006
| INTRODUCTION |
|---|
|
|
|---|
The NHSBSP has recently been extended: The upper inclusion age has been increased from 64 years to 69 years, and two images (oblique and craniocaudal views) of each breast are now obtained at every examination. When the program was started in 1988, two views of each breast were obtained at the first (prevalent) screening session, and single views were obtained thereafter. This has increased the workload of readers (4), and additional radiologists and radiographers are needed to support and sustain these changes (5). Originally, mammograms were read by one radiologist (hereafter, single reading). However, in most centers, mammograms are now read independently by two radiologists (hereafter, double reading), as cancer detection rates are 5%15% higher with double reading than with single reading (611).
A computer-aided detection (CAD) system that uses prompts to attract reader attention to suspicious features on mammograms (1214) could conceivably improve the performance achieved with single reading so that it matches the performance achieved with double reading. The potential of CAD to enable detection of additional cancers and detection of cancers at an earlier stage than they would be detected with single reading has been demonstrated by the findings of several retrospective studies (1518). However, evidence of a benefit from the use of CAD in a prospective screening setting is both limited and conflicting. Freer and Ulissey (12) reported a 19.5% increase in the cancer detection rate and a parallel 18.5% increase (from 6.5% to 7.7%) in the recall rate when they used CAD to assess 12 860 mammograms. In contrast, in a time series study of 115 571 mammograms, no difference in cancer detection or recall rates was reported after CAD was introduced (19). Recall rates for 14 817 mammograms after the introduction of CAD were compared with historic data for 23 682 mammograms; the introduction of CAD did not significantly affect the recall rate (16).
It is not known if the sensitivity achieved with single-reading CAD is comparable to that achieved with double-reading CAD. The findings of retrospective reviews in which prior mammograms from cancer cases were used suggest that single reading with CAD could yield the same performance as double reading, provided all correct prompts were recalled (18). On the other hand, the performance of double reading of mammograms was better in a simulated setting in which the performance data of individual readers were compared (20). Two other studies (21,22) revealed similar sensitivity and specificity between a single reading and a simulated double reading; however, these studies were not conducted in a screening environment with a large number of normal cases. This limits the possibilities for extrapolating the results to those in a screening setting in which only one in 100200 mammograms shows cancer.
CAD systems generally have high sensitivity but only moderate specificity. Large numbers of prompts are generated; thus, the reader must decide which prompts require action and which should be dismissed. This could reduce the effectiveness of prompting when CAD is used as part of a routine screening program in which most of the mammograms are normal. The reader must learn to correctly dismiss the prompts that mark benign lesions or normal tissue without dismissing the prompts that mark cancers.
A prospective randomized trial would yield the most information pertaining to the role of CAD in breast screening. However, in the United Kingdom, it is first necessary to demonstrate that the performance of single reading with CAD is no worse than the performance of double reading, which is the current standard practice. This ensures that women will not be disadvantaged by the use of CAD. The Computer Aided Detection Evaluation Trial, or CADET, was therefore established to compare double reading with single reading with CAD. Thus, the purpose of our study was to retrospectively determine whether CAD can improve the performance of single reading to the level achieved with double reading in the United Kingdom.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Case Selection
This trial was designed to compare the performances (cancer detection rate and recall rate) of single reading with CAD and double reading in a cohort of cases. Sufficient time needed to elapse between double reading and single reading with CAD so that interval cancers and subsequent screening-detected cancers could be identified. The Computer Aided Detection Evaluation Trial was an equivalence study in terms of sensitivity (Appendix); it was designed so that a 95% confidence interval (CI) on the difference between cancer detection rates would rule out the possibility that the performance of single reading with CAD was more than 10% worse than the performance of double reading. This required re-reading 14 00015 000 mammograms that depicted approximately 210 cancers. All the cancers were retained, and almost 5000 mammograms from women with no subsequent diagnosis of breast cancer were discarded randomly. This yielded a data set of 10 267 mammograms, of which 236 (2.3%) depicted cancer and 10 031 were normal; this data set was more consistent with the case mix encountered in a routine screening setting than were the data sets of most studies.
Mammograms were sampled from all routine screening mammograms that were obtained in 1996 in women aged 50 years or older and were double read. Two NHSBSP screening centers (Northeast Scotland Breast Screening Center, Aberdeen, Scotland, and Nightingale Center) participated in this study. Cancers had histologic evidence of malignancy present at biopsy, or in rare cases, at cytology. Study cancers were classified into three groups: (a) those detected at screening in 1996 (screening-detected cancers), (b) those detected in the 3 years between scheduled screenings (interval cancers), and (c) those detected at screening in 1999 (subsequent screening-detected cancers). Cancers detected after screening in 1999 were identified and termed poststudy cancers. Screening-detected cancers were identified from records held by the central screening system and confirmed by the two centers. Interval cancers were identified by means of record linkage undertaken by local cancer registries; the two centers confirmed these cancers. Local research ethics committee approval was obtained, informed consent was not required, and all mammograms were anonymized.
Digitization and CAD Analysis
The anonymized mammograms were digitized, and the results of CAD image analysis were displayed on a flat-panel display screen as low-spatial-resolution images overlaid with markers that indicated areas of potential abnormalities. The image analysis algorithms generated prompts for masses (asterisks) and microcalcifications (triangles). Regions in which both a mass and a microcalcification were depicted were marked with a composite "malc" (ie, mass and calcification) marker. In all cases, the size of the prompt was related to the likelihood of cancer, as determined by the algorithms. The PeerView (R2 Technology) facility enabled readers to view annotated enlargements of regions that contained prompts.
Readers and CAD Training
Four readers from each center participated in the study. All eight readers (F.J.G., C.R.M.B.) met the quality assurance standards of the NHSBSP (23) and read an average of more than 5000 mammograms per year. Readers at the Northeast Scotland Breast Screening Center had 213 years of screening experience; at the Nightingale Center, readers had 215 years of screening experience. The readers who interpreted mammograms in 1996 met the same NHSBSP standards and had 16 years of experience at the Northeast Scotland Breast Screening Center and 78 years of experience at the Nightingale Center. Four of the eight readers in our study (F.J.G., C.R.M.B.) participated in double reading of mammograms in 1996. The study coordinator (M.A.M.) ensured that these readers did not read mammograms that they read previously in 1996.
The four readers at each center underwent a 2-month training period that consisted of an initial training session taught by representatives of R2 Technology; this was followed by consolidation and practice sessions in which six training sets of 75100 cases were used. Training sets were completely separate from cases used in the study. After each session, readers were given access to truth data that allowed them to assess and improve their performance with CAD. In the initial sets, 25% of mammograms showed cancer. This percentage was progressively reduced to 5% to train the readers to dismiss prompts on normal mammograms in an environment with a low cancer rate, such as the screening setting (24).
Reading Procedure
Screening mammograms were randomly allocated to be read by a radiologist who had not been recorded as the first or second reader in 1996. Each mammogram was first viewed by the reader, and abnormalities (if any) were recorded on a pro forma data sheet, along with a recommendation to recall the patient for further assessment or for the patient to return in 3 years for routine screening. Prior mammograms were hung if they had been used at the time of the original double reading. The position and type of abnormality were marked, and the degree of suspicion was scored on a five-point scale (1, normal or benign; 2, probably benign; 3, indeterminate; 4, suspicious; and 5, malignant). The reader then accessed the prompt image and reviewed the mammograms to further examine any areas with CAD prompts. Any additional findings, along with another score and a recommendation for future imaging, were recorded. Readers were aware that this reading procedure differed from a routine screening procedure in that the case mix contained a higher proportion of cancer cases and the recommendation for recall would be recorded but no action would be taken.
The reading procedures that were used in 1996 at each screening center were replicated for individual readers using CAD. In one center, a patient was recalled if either reader recommended that she be recalled. In our study, each reader decided whether to recall a patient. In the other center, mammograms were scored with a five-point scale. Patients were recalled if either reader assigned a score of 3 or higher to a mammogram. If both readers assigned a score of 1 to a mammogram, the patient was not recalled. If a score of 2 was assigned by either reader, the case was discussed by the readers involved or with another reader, and they decided whether to recall the patient. In our study, cases with a score of 2 were discussed with another reader to determine whether to recall the patient.
Statistical Analysis
The primary outcome measures were cancer detection rate and recall rate. We also compared the overall recall rate and the recall rate of normal cases (false-positive findings) for a single reader using CAD with the recall rate of the original two readers. Thus, the detection rate of a single reader using CAD was calculated by using the screening-detected cancers, the interval cancers, and the subsequent screening-detected cancers. Poststudy cancers were not included in the first analysis. Sensitivity analysis was performed thereafter and included these cancers. Statistical significance and CIs were calculated with the McNemar test to take into account the matched nature of the data (25). Stata statistical software (version 8.0; Stata, College Station, Tex) was used (26).
| RESULTS |
|---|
|
|
|---|
|
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The recall rate was 8.6% for single reading with CAD and 6.5% for double reading. In effect, this difference was due to reduced specificity for single reading with CAD. Recall rates in the United Kingdom in 1996 and 1997 were 6.7% in Scotland and 4.9% in England. In 2002 and 2003, the recall rate was 6.1% in Scotland and 5.2% in England. The observed increase in the recall rate for single reading with CAD is not excessive when one considers that cancer detection rates in the United Kingdom and the United States are similar but that the recall rate in the United Kingdom is approximately half that in the United States (1). Published reports of single reading with CAD in a prospective setting indicate that there is an increase in recall rate (12) or no significant change (16,19).
We believe the strength of our study was its large sample size that included almost all subsequent screening-detected cancers and interval cancers. This allowed the pathologic findingbased reference standard to be used to determine which cases were cancers. In our study, the ratio of cancer cases to normal cases was closer to the ratio in a screening situation than the ratio in many previous retrospective studies (9,16,20,21,27); this resulted in a better indication of how the reader might behave in a prospective setting when he or she would need to ignore a large number of false prompts on normal screening mammograms.
Most evaluations of CAD have been conducted in the United States, where the screening population and program differ considerably from those in the United Kingdom: These studies have yielded higher recall rates, different age ranges and screening intervals, and variable ascertainment of interval cancers (1,3). The success of CAD in a screening program is highly dependent on the specificity of the prompts and the effect that these prompts have on reader behavior and performance. A relatively large number of false prompts may lead to reader fatigue and reduced performance. Readers may begin to ignore both true and false prompts in a situation in which the cancer rate is low, such as routine screening. Readers also need to avoid becoming too reliant on CAD prompting or being falsely reassured of the absence of cancer if there are no CAD prompts (28). If we accept that the cancer detection rate is highest for double reading with arbitration by a third reader (6), it may be that use of arbitration for equivocal cases read with CAD would permit a high sensitivity to be achieved, without compromising specificity, and recall rates with CAD could be kept at a level acceptable to the NHSBSP.
A limitation of our study was that approximately 70% of the cases were single-view mammograms. This does not reflect current practice, in which two-view mammography is the standard in the majority of screening programs. It has been shown that CAD accuracy is less for single-view mammography than for two-view mammography; however, this should not have affected the overall results of our study.
Further limitations of our study were its retrospective design and the difference in experience levels between the readers in this study and those in the original reading exercise. Single reading with CAD was performed in 2003 and 2004, whereas the original double reading was performed in 1996. Although the range of experience of the eight readers in our study was comparable to that of the readers who read mammograms in 1996, reader performance may have improved during this period, and this could be partially responsible for the better performance observed here for single reading with CAD. Standardized detection rates (ie, the ratio of the number of invasive cancers detected to the number of invasive cancers expected in the age distribution of the population) in the United Kingdom in 2003 were 1.35 at first (prevalent) screening and 1.18 at subsequent screening (29); corresponding figures in 1996 and 1997 were 1.17 and 0.94, respectively. This suggests a 15% relative improvement in cancer detection for prevalent screening and a 26% improvement for subsequent screening. Improvements in image quality, the increasing use of two views (3032), and higher background cancer incidence (33) probably account for most of the observed increase in the cancer detection rate. There may have been a modest improvement in reader performance, but this was likely to have been less than the 15% improvement (from 42.6% to 49.1%) observed for single reading with CAD. Thus, the observed improvement may correspond to a true equivalence of the two detection regimens rather than to an advantage for single reading with CAD. It remains unlikely, however, that single reading with CAD is actually inferior to double reading.
In addition, re-reading differed from routine screening in that the readers were aware of the cancer-enriched case mix and that their decisions would not have any clinical implications. Both factors could have led to an increase in the recall rate that could have contributed to an increase in the cancer detection rate. Readers are aware of the adverse psychological consequences of recall; therefore, they try to keep recall to a minimum, without missing cancers. However, in our study, readers knew that the recall would not actually happen, and knowing this may have caused them to increase their recall rate subconsciously.
In conclusion, our results show that the performance of single reading with CAD is equivalent to the performance of double reading. There was a slight increase in the recall rate that may have been caused by the additional cancers in the study population. Double reading is the normal practice in the NHSBSP; however, it would now be acceptable and ethical to undertake a randomized controlled trial to determine whether diagnostic performance is maintained with CAD in a prospective setting.
| APPENDIX: SAMPLE SIZE CALCULATION |
|---|
|
|
|---|
|
|
| ADVANCE IN KNOWLEDGE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: CAD = computer-aided detection CI = confidence interval NHSBSP = National Health Service Breast Screening Program
Authors stated no financial relationship to disclose.
Author contributions: Guarantor of integrity of entire study, F.J.G.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, F.J.G., S.M.A., M.G.C.G., C.R.M.B.; clinical studies, F.J.G., M.A.M., C.R.M.B.; experimental studies, M.A.M.; statistical analysis, M.A.M., P.M.G., S.W.D.; and manuscript editing, all authors
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
F. J. Gilbert, S. M. Astley, M. G.C. Gillan, O. F. Agbaje, M. G. Wallis, J. James, C. R.M. Boggis, S. W. Duffy, and the CADET II Group Single Reading with Computer-Aided Detection for Screening Mammography N. Engl. J. Med., October 16, 2008; 359(16): 1675 - 1684. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gromet Comparison of Computer-Aided Detection to Double Reading of Screening Mammograms: Review of 231,221 Mammograms Am. J. Roentgenol., April 1, 2008; 190(4): 854 - 859. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. F. Brem Blinded Comparison of Computer-Aided Detection with Human Second Reading in Screening Mammography: The Importance of the Question and the Critical Numbers Game Am. J. Roentgenol., November 1, 2007; 189(5): 1142 - 1144. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |