|
|
||||||||
Breast Imaging |
1 From the Biostatistics Division, Department of Interdisciplinary Oncology, Moffitt Research Center, 12902 Magnolia Dr, Tampa, FL 33612-9497 (M.J.S.); Departments of Radiology (B.C.Y.) and Biostatistics (B.F.Q.), University of North Carolina, Chapel Hill, NC; Applied Research Program, Division of Cancer Control and Population Studies, National Cancer Institute, Bethesda, Md (R.B.); Department of Biostatistics, University of Washington, Seattle, Wash (W.E.B.); Department of Radiology, University of New Mexico, Health Sciences Center, Albuquerque, NM (R.D.R.); and Department of Radiology, Epidemiology and Biostatistics, University of California at San Francisco, San Francisco, Calif (R.S.). Received February 27, 2006; revision requested April 27; revision received July 28; accepted August 29; final version accepted October 4. Supported by a National Cancer Institutefunded Breast Cancer Surveillance Consortium cooperative agreement (U01CA63740, U01CA86076, U01CA86082, U01CA63736, U01CA70013, U01CA69976, U01CA63731, U01CA70040). Address correspondence to M.J.S. (e-mail: michael.schell{at}moffitt.org).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: The study group included 1 872 687 subsequent and 171 104 first screening mammograms from 1996 to 2001 from 172 and 139 facilities, respectively, in six sites of the Breast Cancer Surveillance Consortium. Institutional review board (IRB) approval was obtained from each site. Informed consent requirements of the IRBs were followed. The study was HIPAA compliant. Recall rate was defined as the percentage of screening studies for which further work-up was recommended by the radiologist. Sensitivity was defined as the proportion of cancers that were detected at screening mammography. Piecewise linear regression was used to model sensitivity as a function of recall rate. This model allows detection of critical recall rates in which significant changes (shifts) occurred in the rates that sensitivity increased with increasing recall rate. Rates were interpreted as number of additional work-ups per additional cancer detected (AW/ACD) or, in other words, the estimated number of additional women needed to be recalled at a given rate to detect one additional cancer.
Results: For first mammograms, a single shift in the estimated AW/ACD rate occurred at a recall rate of 10.0%, with the rate jumping dramatically from 35 to 172. For subsequent mammograms, four shifts were identified. At a recall rate of 6.7%, the estimated AW/ACD increased from 80 to 132, which rendered it the highest desirable target recall rate. At a recall rate of 12.3%, the estimated AW/ACD was 304, which suggests little benefit for any higher recall rate.
Conclusion: Recall rates of 10.0% for first and 6.7% for subsequent mammograms are recommended targets on the basis of their AW/ACD rates (less than 100).
© RSNA, 2007
| INTRODUCTION |
|---|
|
|
|---|
Different groups have recommended different target recall rates. European guidelines recommend a target recall rate of 5%, with an acceptable rate of less than 7% for first screenings and a target recall rate of 3% (acceptable rate <5%) for subsequent screenings (6,7). The American College of Radiology and the U.S. Agency for Health Care Policy and Research both recommend an overall recall rate of less than 10% (8,9). However, to our knowledge, these targets have not been evaluated relative to their effect on sensitivity and PPV on the basis of data that reflect current mammographic screening examinations within clinical practice in the United States.
Thus, the goal of our study was to retrospectively identify target recall rates for screening mammography on the basis of how sensitivity shifts with recall rate.
| MATERIALS AND METHODS |
|---|
|
|
|---|
For our study, data from one site were not included because that site did not collect data at the facility where mammography was performed.
All data related to the screening mammographic examination were collected at the facility at the time of mammography. At mammography, patients completed a breast health survey, which included date of birth, history and date of previous mammography, and reported presence of breast signs and symptoms (lump, nipple discharge, or others, not including breast pain).
The interpreting radiologist recorded the indication for the examination, additional imaging studies performed, and date of previous mammography. In addition, breast density and mammographic assessment were recorded by using the recommended categories of the American College of Radiology Breast Imaging Reporting and Data System (BI-RADS) (13). Breast density was categorized as extremely dense, heterogeneously dense, scattered fibroglandular densities, or almost entirely fat. Mammographic assessment was performed with the following categories: 0, needs additional imaging evaluation; 1, negative finding; 2, benign finding; 3, probably benign finding; 4, suspicious abnormality; and 5, highly suggestive of malignancy.
The registries collected breast cancer information from regional Surveillance, Epidemiology, and End Results programs, state cancer registries, and pathology databases. Cancers were categorized as either invasive disease or ductal carcinoma in situ. (Lobular carcinoma in situ was considered benign for this analysis.)
Mammograms were separated into first or subsequent examinations; 4.4% of the mammograms were dropped at this point, because it could not be determined to which group they belonged. For the subsequent mammograms, 23 (11.8%) of 195 facilities were excluded from the analysis. Reasons for exclusion were that 22 facilities had no cancer results (1993 mammograms) and one facility had a recall rate greater than 40% (6111 mammograms). For first mammograms, 50 (26.5%) of 189 facilities were excluded from the analysis. Reasons for exclusion were that 44 (88%) of the 50 facilities had no cancer results (7999 mammograms), five (10%) of 50 facilities had fewer than 100 mammograms (313 mammograms), and one facility had a recall rate greater than 40% (429 mammograms). After the exclusions, 1 872 687 subsequent mammograms from 172 facilities and 171 104 first mammograms from 139 facilities remained in the study for analysis. Overall, findings from 2 043 791 screening mammograms obtained by 912 radiologists in 172 facilities were included in this study. This represents an average of 2241 mammograms per radiologist.
Mammographic Assessment
All mammographic studies were assessed by radiologists with BI-RADS assessments. The follow-up period for all screening mammography was 12 months or the time to the next screening examination, if that occurred between 9 and 12 months later. For our sensitivity calculations, a screening mammographic examination was considered to yield a positive finding if the assessment was needs further evaluation, suspicious abnormality, suspicious for malignancy (BI-RADS categories 0, 4, and 5, respectively), or probably benign (category 3) when accompanied by a recommendation for immediate imaging follow-up. A screening mammographic examination was considered to yield a negative finding if the assessment was normal, benign (BI-RADS categories 1 and 2, respectively), or probably benign (BI-RADS category 3) and did not have a recommendation for immediate follow-up.
Reference Standard
A mammogram with a positive finding yielded a true-positive finding (TP) if cancer was diagnosed and a false-positive finding (FP) if cancer was not diagnosed in the follow-up period. A mammogram with a negative finding yielded a true-negative finding (TN) if no breast cancer was diagnosed and a false-negative finding (FN) if cancer was diagnosed in the follow-up period.
Sensitivity was defined as the proportion of cancers that were detected, calculated as TP/(TP + FN). Specificity was defined as the proportion of individuals without cancer correctly classified as having a negative finding at mammography, calculated as TN/(TN + FP). Recall rate was defined as the proportion of individuals recalled for additional work-up, calculated as (TP + FP)/(TP + FP + TN + FN). Cancer incidence per 1000 mammograms was calculated as 1000 · (TP + FN)/(TP + FP + TN + FN).
Statistical Analysis
Sensitivity increases with recall rate, but not necessarily linearly. By using a four-step procedure, a nondecreasing, monotonic, piecewise linear fit to the data was constructed for sensitivity as a function of recall rate on the basis of facility-level data that were weighted by the number of cancers. First, isotonic regression analysis (14) was used to model sensitivity as a constant for various ranges of the recall rate. Isotonic regression provides the least-squares fit to the raw data among the class of all monotonic functions. Second, reduced monotonic regression (15) (
* = .50) was used to identify the recall rate groups by combining isotonic regression level sets with similar sensitivity measures. Third, the reduced monotonic regression model was adjusted for site, mean age of women, and percentage of long-interval mammographic examinations (defined for subsequent mammograms as the percentage of mammograms at the facility whose previous mammograms were more than 27 months earlier). Breast density and percentage of women with a personal history of breast biopsy were not included in these adjustments because of incomplete and/or inconsistent reporting across the facilities. Fourth, a concave, monotone, piecewise linear fit (called "concave fit" henceforth) was obtained by joining the mean recall rates for the adjusted sensitivities of the recall rate groups. A piecewise linear segment was used to span multiple groups, if a nonincreasing slope with increasing recall (the concavity requirement) was needed. Separate fits were obtained for first and subsequent mammograms.
We provide a brief explanation of the four-step modeling procedure as follows. Because of random variation, virtually no regression relationships are perfectly ordered. In our case, sensitivity does not perfectly increase with increasing recall rate. The recall rate groups provide regions of the domain (recall rate in our case) where the response (sensitivity in our case) is found to be fairly constant. Because we do not believe that sensitivity is intrinsically flat, with jump points at certain recall rates, we use the recall rate groups to construct the concave fit by using linear interpolation of points.
The slopes for the piecewise linear segments were interpreted as number of additional work-ups per additional cancer detected (AW/ACD). We defined AW/ACD as (DNR)/(DCD), where DNR is difference in number of patients recalled and DCD is difference in number of cancers detected. DCD = OCR · DS, where OCR is the overall cancer rate and DS is difference in sensitivities. OCR is the number of cancers per mammogram for the entire study. This statistic is the reciprocal of what could be called the "incremental PPV" (ie, ACD/AW), where the incremental PPV is obtained by including only women who would not have been recalled at the lower recall rate.
The 95% confidence interval for AW/ACD was obtained by substituting in the lower and upper 95% confidence limits for DS in the AW/ACD formula. The limits were obtained by adding and subtracting 1.96 times the standard deviation for DS, where standard deviation was obtained by using standard binomial theory. For example, the variance for the difference in sensitivity between recall rate groups 3 and 4 for subsequent mammograms (Table 1) is 0.691 · (1 0.691)/1019 + 0.749 · (1 0.749)/4486 = 0.000251, so the 95% confidence interval for DS is 0.749 0.691 ± 1.96 ·
0.0000251 = (0.027, 0.089).
|
| RESULTS |
|---|
|
|
|---|
|
|
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
We believe that the best choices for recall rate targets are those that represent the mean recall rates for one of the recall rate groups established in the Results section. By using the metric of AW/ACD, a detriment (additional work-up) to benefit measure of increasing or decreasing recall rate, individual mammographers may be able to gauge the effect that changing their recall rate by 1% would have on performance.
While individual mammographers and informed patients may have different ideas regarding the optimal trade-off, we suggest using an AW/ACD of, at most, 100. This benchmark level seems to be a good choice. In our study, a level larger than 132 would lead to a target recall rate of 12.3%, which is higher than that recommended by American College of Radiology guidelines (8,9). A level below 80 would result in a target recall rate of 4.3%. The latter rate is close to European guidelines (6,7), which call for rates below 5%. However, less than 15% of U.S. patients underwent mammography at facilities with rates at or below that rate. The resulting target recall rates are 10.0% for first mammograms and 6.7% for subsequent mammograms. Notably, the recall rate groups with the largest percentages of mammograms evaluated include these target rates, with a small percentage below these rates and a sizable percentage above42.9% of first and 36.3% of subsequent mammograms. European guidelines of less than 5% for subsequent screenings correspond to a lower tolerance for AW/ACD (6,7). Interpreted according to our concave fit, their recommendations would suggest use of a maximum AW/ACD between 51 and 80, which would lead to a recall rate group with a 4.3% mean recall rate and 69% sensitivity. Their guideline for less than 7% recalls after first screenings would place them in the lower end of the 6.1%13.2% recall rate group given in Table 4, which corresponds to an 83.3% sensitivity.
Yankaskas et al (19) showed that these differences do exist when comparing international screening recall rates. In a meta-analysis of 24 mammographic screening programs, Elmore et al (1) concluded that "the percentage of mammograms judged to be abnormal in North American programs was 24 percentage points higher than it was in programs from other countries without evident benefit in the yield of cancers detected per 1000 women screened, although an increase was noted in DCIS [ductal carcinoma in situ] detection." This may be exaggerated because they included BI-RADS category 3 as a recall, which is not done internationally or in this study.
Yankaskas et al (5) concluded that facilities with recall rates between 4.9% and 5.5% achieved the best trade-off of sensitivity and PPV. Their range of recall rates was obtained by performing reduced monotonic regressions on the relationships of sensitivity and PPV with recall rate and by identifying the range in which both sensitivity and PPV were maintained at high levels. Our study is nearly 10 times larger than theirs and, to our knowledge, is the largest study to date to examine this issue in the United States. We believe that it improves on that study by splitting mammograms into first and subsequent screenings and by using the concave fit approach, with its accompanying AW/ACD concept. Their target rate is between the reasonable targets of 4.3% and 6.7% presented by us.
There is some concern about the establishment of recall rate guidelines in mammography. Gur et al (20) examined the correlation between recall and cancer detection rates in a group of 10 radiologists from a single academic institution. Noting an increase in the presumed linear relationship, where the recall rates ranged from 7.7% to 17.2%, they concluded that "the performance level of a radiologist in the screening environment is a complex, multifactorial issue that cannot and should not be simplified. Reducing recall rates by "decree" (through the enforcement of recommended practice guidelines) may result in a corresponding reduction in the detection rates." Their study results, however, focused on detection rates, which lack the additional information on missed cancers that is available with sensitivity. They also modeled the relationship between the recall and detection rates as linear. Consequently, their evidence cannot suggest that any level of recall rate is too high. Inspection of their data suggests that beyond about a 12% recall rate, little or no gain in the detection rate occurred. In a recent commentary, Hardesty et al (21) suggested usage of three nonlinear models to determine the sensitivityrecall rate relationship, including a concave fit, which they erroneously called convex. We used a concave fit in our analysis, which we believe makes sense clinically because discrimination between cancer and noncancer as the recall rate increases becomes increasingly difficult.
In a re-review of missed cancers from a Dutch breast screening program that has the lowest recall rate worldwide, Otten et al (22) examined the effect of increasing the recall rate. In their article, 15 mammographers re-reviewed 495 screening mammograms with negative findings for subsequent screenings, which included 245 missed cancers, from which they extrapolated their outcome measures to 500 000 subsequent screenings by using an equivalent concept to AW/ACD. For recall rates between 4% and 7%, their AW/ACD values were 30%70% higher than ours. For recall rates between 8% and 10%, their AW/ACD values were within 14% of ours. It is unclear whether the higher values they obtained are because of differences in mammographic evaluation between the United States and the Netherlands, different screening intervals, or the relatively few cancers on which they based their estimates.
Our study had limitations. The relationship between recall rate and sensitivity was assessed by using seven community-based registries in the United States. Extension of these results to other U.S. states and to other regions of the world cannot be assumed from statistical principles but must rather be based on subjective judgment. Because the women included in the BCSC are largely representative of women undergoing screening mammography within the United States, however, we believe it is likely that most radiologists' patients will be similar to those within the BCSC (11). Sensitivity differences were found between the states. We believe that this is most likely because of differences in completeness in identifying cancers between the various state registries. This site difference was adjusted for in obtaining our models, with North Carolina as the referent group because it provided the largest number of mammograms to this study. Thus, the actual sensitivities will differ by state. However, the locations where the AW/ACD rates shifted were modeled as being the same for all sites. As further data are accrued, additional covariates affecting the sensitivities or, perhaps, the location of AW/ACD rate shifts may be found. However, to our knowledge, this study represents the largest collective evidence on the recall ratesensitivity relationship published to date and incorporates the data from the second-largest study (5).
The data and analysis presented demonstrate a wide variation in recall ratesensitivity pairs in this convenience sample of U.S. radiologists. This variation likely represents both different preferences and different abilities among radiologists. Clustering of the facilities' results, together with analysis of the gains from additional work-up for these radiologists, strongly suggests ranges of targeted performance. We recommend operating at a target recall rate of approximately 6.7% for subsequent mammograms and 10.0% for first mammograms, because these rates keep the estimated number of AW/ACD lower than 100.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: AW/ACD = additional work-ups per additional cancer detected BCSC = Breast Cancer Surveillance Consortium BI-RADS = Breast Imaging Reporting and Data System PPV = positive predictive value
Authors stated no financial relationship to disclose.
Author contributions: Guarantor of integrity of entire study, M.J.S.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, M.J.S., B.C.Y., B.F.Q., R.S.; clinical studies, B.C.Y., R.D.R., R.S.; statistical analysis, M.J.S., B.F.Q., W.E.B., R.S.; and manuscript editing, all authors
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
W. A. Berg, J. D. Blume, J. B. Cormack, E. B. Mendelson, D. Lehrer, M. Bohm-Velez, E. D. Pisano, R. A. Jong, W. P. Evans, M. J. Morton, et al. Combined Screening With Ultrasound and Mammography vs Mammography Alone in Women at Elevated Risk of Breast Cancer JAMA, May 14, 2008; 299(18): 2151 - 2163. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Elmore and R. J. Brenner The More Eyes, the Better to See? From Double to Quadruple Reading of Screening Mammograms J Natl Cancer Inst, August 1, 2007; 99(15): 1141 - 1143. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |