|
|
||||||||
Breast Imaging |
1 From the Department of Radiology, University of Chicago, 5841 S Maryland Ave, MC2026, Chicago, IL 60637 (Y.J., C.E.M., R.A.S.); and Group Health Cooperative, Seattle, Wash (D.L.M.). Received February 9, 2006; revision requested April 7; revision received June 7; accepted July 7; final version accepted September 20. Supported in part by the National Cancer Institute through grants R01 CA92361 and U01CA86076. Address correspondence to Y.J. (e-mail: y-jiang{at}uchicago.edu).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: Each registry and the statistical coordinating center received institutional review board approval along with approval for consenting processes or a waiver of consent to enroll participants, link data, and perform analytic studies. This study was HIPAA compliant. The authors estimated the distribution of individual radiologists' breast cancer detection rates for 2 289 132 screening mammograms (9030 cancers) read by 510 radiologists in the United States who participated in the Breast Cancer Surveillance Consortium from 1996 through 2002. They then computed the distributions of breast cancer detection rates expected from a trial of screening mammography and multiple radiologists, as well as similar distributions for a hypothetical new modality that depicts one additional cancer per reader per 1000 screening examinations. Statistical power was calculated.
Results: The mean screening mammography cancer detection rate for individual radiologists was 3.91 cancers (standard deviation, 1.93; range, 0.2513.75) per 1000 examinations. To achieve 80% power to detect a hypothetical increase of one additional cancer detected per reader per 1000 screening examinations, a trial in which a new modality was compared with standard mammography would require at least 25 radiologists each reading the images of at least 8000 screening examinations or 91 radiologists each reading the images of 10002000 examinations.
Conclusion: The low breast cancer prevalence in an average-risk screening population and the large interradiologist variability in the observed cancer detection rate suggest that for new technologies to demonstrate significant improvement in cancer detection rate in a clinical trial, very large samples of both radiologists and patients will be required.
© RSNA, 2007
| INTRODUCTION |
|---|
|
|
|---|
| MATERIALS AND METHODS |
|---|
|
|
|---|
We included only screening mammograms obtained between January 1, 1996, and December 31, 2002, in women without a personal history of breast cancer. A mammogram was considered a screening examination if the radiologist indicated that it was obtained for routine screening. Mammograms that included only unilateral views and those that followed a mammogram or other related radiologic examination within the preceding 9 months were excluded because they were likely diagnostic examinations. Mammograms were also excluded if computer-aided detection was used or if the radiologist read fewer than 500 mammograms during the study period.
A mammogram was considered positive if it was given a BI-RADS (Breast Imaging Reporting and Data System) assessment score of 0 (need additional imaging evaluation, 176 922 mammograms [7.7% of total cases]), 4 (suspicious abnormality, 9540 mammograms [0.4%]), or 5 (highly suggestive of malignancy, 1244 mammograms [0.1%]) (24). A mammogram was considered negative if it was given an assessment score of 1 (negative, 1 552 941 mammograms [67.8% of total cases]), 2 (benign finding, 488 543 mammograms [21.3%]), or 3 (probably benign finding) with a recommendation for short-interval or routine follow-up (42 346 mammograms [1.8%]). Mammograms given an assessment score of 3 with a recommendation for immediate follow-up (17 596 mammograms [0.8% of total cases]) were recorded as having a score of 0 (positive), because the 0 assessment more appropriately matches the recommendation. The cancer detection rate was calculated (by D.L.M.) separately for each radiologist as the number of breast cancers detected per 1000 screening mammograms (invasive carcinoma or ductal carcinoma in situ diagnosed within 1 year after an examination with a positive result).
Statistical Analyses
We estimated the population probability distribution of observed single-reader cancer detection rates by scaling the histogram of the calculated single-reader cancer detection rates. We assumed that this probability distribution accurately reflects single-reader cancer detection rates observed in trials. At least three sources of variability affect the observed single-reader cancer detection rate: (a) the single-reader screening mammogram volume, which is associated with intrareader variability in the observed single-reader cancer detection rate (smaller volume produces greater variability); (b) interreader variabilitythe between-reader variability in cancer detection rates that would persist even when measured from infinitely large mammogram volumes; and (c) possible variation in patient population demographics and breast cancer incidence. We did not separate these sources of variability because the observed single-reader cancer detection rates in future trials would be influenced by all these sources of variability combined. Probability distributions of single-reader cancer detection rate were estimated separately for 10 single-reader screening mammogram volumes.
To compare breast cancer detection rates of mammography and a hypothetical new modality observed in trials, one of us (Y.J.) calculated the probability distribution (in replicated trials) of observed multireader mean cancer detection rates from the probability distribution of observed single-reader cancer detection rates (Appendix). For such calculation, single-reader cancer detection rates will be uncorrelated across modalities if cancer detection rates for two modalities are measured from two different groups of readers and two different cohorts of patients. However, if trial investigators compare cancer detection rates from the same group of readers or from the same patient cohort, then the single-reader cancer detection rates will be correlated across modalities. This correlation will give rise to higher statistical power in trials. However, the data needed for estimating this correlation do not exist until after a new modality has already been used extensively in clinical practice.
Therefore, we substituted the correlation between mammography and itself for the correlation between mammography and a new modality, and we assumed that the former correlation would be at least as strong as the latter correlation. One of us (D.L.M.) measured the correlation between mammography and itself (which also depends on the single-reader mammogram volume) from two single-reader cancer detection rates (rate 1, mammograms obtained in 1996, 1998, 2000, or 2002; rate 2, mammograms obtained in 1997, 1999, or 2001) for 392 readers who each contributed at least 500 screening mammograms to the calculation of each rate. These two rates were chosen to minimize the effect of gradual longitudinal changes. Analysis of larger case volumes was unreliable owing to small numbers of qualified readers.
To calculate the probability of trial outcomes, we postulated that, compared with standard mammography, a new modality would depict one additional cancer for every reader per 1000 screening examinations. Given this postulate, one of us (Y.J.) calculated, for a multireader trial comparing a new modality against mammography, (a) the probability of observing no increase (or observing a decrease) in the cancer detection rate; (b) the probability of observing at least half of the postulated improvement (ie, one-half or more additional cancers detected per 1000 screening examinations); and (c) the power for detecting a statistically significant increase in the cancer detection rate.
Statistical power was defined as the probability of correctly rejecting the null hypothesis of no difference with a critical value of .05, given the alternative hypothesis of an increase of one additional cancer detected per reader per 1000 screening examinations. The calculation (Appendix) was performed numerically with a custom-written computer program that was validated by means of analytical calculation of hypothetical normally distributed observed single-reader cancer detection rates. (These calculations were exact in that they produced discrete but the exact same results that the analytic calculations produced, with the exception of negligible numerical errors.) Normality was not assumed in the calculation except for validation of the methods.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
500 cases per reader per time period). Inclusion of this correlation reduced by 1%6% the probability of observing no increase (or observing a decrease) in the cancer detection rate, increased by 4%5% the probability of observing at least half the postulated increase for the new modality, and increased by 1%13% the power for detecting a significant increase (Fig 3). Using sample sizes of published trials and a hypothetical 70 000-patient-per-arm trial, we calculated the statistical power for trials of full-field digital mammography and computer-aided detection based on the postulated one additional cancer detected per reader per 1000 screening examinations for these new modalities. These power projections do not correspond to the actual power of the trials because, owing to a lack of necessary data, we postulated the benefit of the new modalities and calculated statistical power without including the effect of the correlation (increase in power) achieved by using the same readers or the same patient cohort across modalities. With these postulates, most published studies (Table 2) did not have sufficient numbers of patients and radiologists to achieve 80% power to demonstrate a statistically significant increase if the true increase in cancer detection rate were one cancer per 1000 screening examinations.
|
For a hypothetical study involving 70 000 patients per arm (Table 3), power varied from about 35% for trials that consisted of eight readers each with a case volume of 9000 examinations to about 84% for trials that consisted of 93 readers each with a case volume of 750 examinations, indicating greater power for trials that consist of more readers.
|
| DISCUSSION |
|---|
|
|
|---|
We found that it can be difficult to demonstrate significant improvement in the breast cancer detection rate for a new modality in clinical trials, because large numbers of patients and radiologists are required to achieve adequate statistical power. To detect a uniform increase in the cancer detection rate of one cancer per 1000 screening examinations in a two-arm trial with equal sample sizes in each arm would require approximately 93 radiologists and 70 000 patients per arm if each radiologist read a mean of 750 examinations, or 25 radiologists and 225 000 patients per arm if each radiologist read a mean of 9000 examinations. Detection of smaller, more realistic increases in the cancer detection rate would require even larger trials. Large patient samples are necessary for accrual of sufficiently large numbers of cancers. Large radiologist samples are an additional independent requirement for trial sample sizes.
The requirement of large numbers of patients and radiologists to obtain adequate statistical power suggests that it could be unduly burdensome to demonstrate improved cancer detection rates in clinical trials for a new breast imaging modality compared with the detection rates achieved by using the standard of care. It raises the question of whether it is effective, and indeed possible, to evaluate every new modality in randomized controlled trials with cancer detection rate as the end point. Unfortunately, our analysis results do not suggest an alternative solution, except that of using the same patient and radiologist cohorts across modalities, which increases statistical power by increasing correlation. Some increase in power can occur from matching low-case-volume radiologists across modalities; more increase can be expected from matching high-case-volume radiologists and matching patients across modalities. Statistical methods can also be used to adjust for differences in patient cohorts and increase the power (16).
The requirement of large numbers of patients and radiologists to achieve adequate statistical power also suggests that introducing new better modalities into clinical practice can be difficult. If routine audits of individual-practice performance parameters become commonplace, then one will be able to monitor whether practices that use new better modalities demonstrate consistently higher cancer detection rates than do comparable practices (with similar patient demographics and cancer incidence) that use conventional mammography. However, contrary to naïve expectations, the large variability associated with ascertainment of the breast cancer detection rate can make it difficult to observe an increase in the cancer detection rate in individual practices over short periods of time, giving rise to the possibility of inconsistent data. Inconsistent data on new modalities could cause confusion and controversy for consumers, payers, and public health policy makers and could even lead to erroneous calls for abandonment of new and better modalities.
Our study had a number of limitations. First, our analysis involved the use of observed single-reader cancer detection rates, which are more difficult to measure precisely than are multireader mean cancer detection rates. Our approach was necessary for estimating the outcome and statistical power of multireader trials involving an arbitrary number of readers without carrying out such trials.
Second, the distributions of single-reader cancer detection rates were estimated from 510 radiologists in the United States; thus, the results may be specific to the patient and radiologist populations of this study and not necessarily applicable to other populations. However, meaningful inferences can be drawn from this analysis because (a) the data used in this analysis were from seven regional breast cancer mammography registries in the United States, which represent a large sample of screening mammography in the United States, and were obtained from women with demographic characteristics similar to those of women in the general U.S. population (20); (b) the statistical analysis was tolerant of uncertainties in estimating the distribution of single-reader cancer detection rates, because it involved multiple summations of distributions for which the central-limit theorem applies; and (c) the 95% confidence intervals for power estimates were reasonably narrow.
Third, the postulated increase of one additional cancer detected per 1000 screening examinations is large, given the low breast cancer prevalence in average-risk screening populations. In our study population, the breast cancer detection rates for all methods combined and for screening mammography were 5.09 and 3.94 cancers per 1000 examinations, respectively, suggesting that one additional cancer detected per 1000 screening examinations is near the upper limit. Smaller increases in cancer detection rate would require even larger studies than those reported here.
Fourth, the postulated uniform increase in cancer detection rate is a simplistic possibility of how the use of new modalities may improve the cancer detection rate. Fifth, the results of this study are based largely on mathematical modeling and therefore need to be validated independently in observational studies. Finally, we did not analyze recall rate or specificity in this study. Imprecision in ascertainment and interreader variability for these measures will affect trial outcomes in a manner similar to imprecision in ascertainment and interreader variability for cancer detection rate reported herein.
In summary, low breast cancer prevalence in an average-risk screening population and interradiologist variability cause large variation in observed cancer detection rates, which can mask improvements in the cancer detection rates achieved with new breast imaging modalities in clinical trials and clinical practices. This masking effect may partially explain some of the lack of positive feedback on the initial clinical use of new technologies such as computer-aided detection, even when the technologies have been shown in preclinical retrospective trials to be beneficial. Investigators must consider interradiologist variability when calculating statistical power for trials of new screening modalities.
| APPENDIX |
|---|
|
|
|---|
Second, we calculated the distribution of multireader mean observed cancer detection rates for a postulated increase of one additional cancer detected per 1000 screening mammograms for every radiologist. This distribution was identical to the first distribution of mean cancer detection rates, with the exception that the mean of the second distribution was higher by an amount equal to the postulated increase. Last, we calculated the distribution of observed increases in the cancer detection rate from convolution of the two mean observed cancer detection rate distributions. If the multireader mean cancer detection rate were normally distributed, then the observed increase would also be normally distributed, with a mean equal to the postulated increase in the cancer detection rate (one cancer per 1000 examinations) and a variance twice as great as that of the multireader mean cancer detection rate. Thereforeand this is not necessarily intuitivethe observed increases in cancer detection rate were more variable than were the observed multireader mean cancer detection rates. The analytical expectations of normally distributed cancer detection rates were used to validate the method of numerical calculation that was used to calculate the results reported here.
For studies to measure the cancer detection rate of two modalities for the same readers (ie, considering correlation), we first calculated the probability of the observed single-reader increase in cancer detection rate before calculating the probability of the mean observed multireader increase. The distribution of observed single-reader increases in cancer detection rate was calculated from a bivariate joint-probability distribution of observed single-reader cancer detection rates measured from the two time periods. A postulated increase of one additional cancer detected per 1000 screening mammograms was added marginally to one of the two cancer detection rates in this distribution. Then, the probability of the observed single-reader increase in cancer detection rate was calculated by integrating this bivariate joint distribution along a diagonal direction. Finally, the distribution of observed multireader mean increases in cancer detection rate was calculated from the distribution of observed single-reader increases by means of the Fourier technique. The 95% confidence intervals (for calculations including and excluding the effect of correlation) were based on 1000 bootstrapping samples of the observed single-reader cancer detection rates.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. J. Van Wert, T. S. Horowitz, and J. M. Wolfe Even in correctable search, some types of rare targets are frequently missed Atten Percept Psychophys, April 1, 2009; 71(3): 541 - 553. [Abstract] [PDF] |
||||
![]() |
R. M. Nishikawa, S. Acharyya, C. Gatsonis, E. D. Pisano, E. B. Cole, H. S. Marques, C. J. D'Orsi, D. M. Farria, K. M. Kanal, M. C. Mahoney, et al. Comparison of Soft-copy and Hard-copy Reading for Full-Field Digital Mammography Radiology, April 1, 2009; 251(1): 41 - 49. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. J. Gilbert, S. M. Astley, M. G.C. Gillan, O. F. Agbaje, M. G. Wallis, J. James, C. R.M. Boggis, S. W. Duffy, and the CADET II Group Single Reading with Computer-Aided Detection for Screening Mammography N. Engl. J. Med., October 16, 2008; 359(16): 1675 - 1684. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Kopans, Y. Jiang, D. L. Miglioretti, C. E. Metz, and R. A. Schmidt History Repeats Radiology, February 1, 2008; 246(2): 645 - 646. [Full Text] [PDF] |
||||
![]() |
S. Ciatto, N. Houssami, D. Gur, R. M. Nishikawa, R. A. Schmidt, C. E. Metz, J. F. Ruiz, S. A. Feig, R. L. Birdwell, M. N. Linver, et al. Computer-aided screening mammography. N. Engl. J. Med., July 5, 2007; 357(1): 83 - 84. [Full Text] [PDF] |
||||
Read all eLetters
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |