Published online before print January 19, 2006, 10.1148/radiol.2382041684
(Radiology 2006;238:809-815.)
© RSNA, 2006
Organized Breast Screening Programs in Canada: Effect of Radiologist Reading Volumes on Outcomes1
Andrew J. Coldman, PhD,
Diane Major, PhD,
Gregory P. Doyle, MSc,
Yulia D'yachkova, MSc,
Norm Phillips, MA, MSc,
Jay Onysko, BA, MA,
Rene Shumak, MD, DMR, FRCP(C),
Norah E. Smith, RTR and
Nancy Wadden, BSc, MD, FRCPC
1 From Population and Preventive Oncology, British Columbia Cancer Agency, 686 W Broadway, Suite 800, Vancouver, BC, Canada V5Z 1G1 (A.J.C., Y.D., N.P.); Institut National de Santé Publique du Québec, Québec City, Québec, Canada (D.M.); Newfoundland Screening Program, St John's, Newfoundland, Canada (G.P.D.); Screening and Early Detection Section, Health Canada, Ottawa, Ontario, Canada (J.O.); Cancer Care Ontario, Toronto, Ontario, Canada (R.S.); Queen Elizabeth Prince Edward Island Mammography Department, Charlottetown, Prince Edward Island, Canada (N.E.S.); and Diagnostic Imaging, St Clare's Mercy Hospital, St John's, Newfoundland, Canada (N.W.). Received September 30, 2004; revision requested December 14; revision received January 20, 2005; accepted February 21; final version accepted April 20.
Address correspondence to A.J.C. (e-mail: acoldman{at}bccancer.bc.ca).
 |
ABSTRACT
|
|---|
Purpose: To examine retrospectively the relationship between radiologist screening program reading volumes and interpretation results.
Materials and Methods: This research project was reviewed by the University of British Columbia Research Ethics Board. Informed patient consent was not required. Data were requested from Canadian provincial screening programs for the period 19882000. Cancer detection rates, abnormal interpretation rates, and positive predictive values (PPVs) were calculated for individual radiologists in those programs. Multivariate Poisson mixed regression models were used to examine the effect of patient age, screening examination sequence (first or subsequent screening examination), province, radiologist reading volume, and interradiologist differences on cancer detection rate, abnormal interpretation rate, and PPV.
Results: The results of the interpretation of 1406678 screening mammograms by 304 radiologists from seven provincial programs were analyzed. Cancer detection rate, abnormal interpretation rate, and PPV all varied according to age of woman screened and screening sequence and across the sample of radiologists. None of the rates varied by province. Neither the cancer detection rate nor the abnormal interpretation rate varied by reading volume, but the average PPV was increased by 34% for volumes over 2000 mammograms versus volumes of 480699 mammograms per year. There was no evidence that the magnitude of variability around the average, for radiologists reading the same volume of mammograms, varied across different volume groups for any of the outcome measures.
Conclusion: Cancer detection did not vary with reading volume. The average PPV for individual radiologists increased as reading volume rose up to 2000 mammograms per year; it stabilized at higher volumes.
© RSNA, 2006
 |
INTRODUCTION
|
|---|
Mammographic screening for breast cancer is recognized as an effective public health measure, although some controversy persists about the age group in which it should be used (1). The interpretation of mammograms by radiologists is a complex process, and substantial variation exists (2). It is recognized that one must continually practice mammographic interpretation to maintain skills, and standards exist for minimum annual interpretation volumes. In North America (the United States and Canada) such volumes are comparatively low, at 480 mammograms per year, compared with the 2000 mammograms required in Australian screening programs (3) and the 5000 mammograms required in United Kingdom screening programs (4). The relationship between reading volume and accuracy of interpretation has received considerable attention, with some investigators finding a relationship (5,6), others finding none (7), and still others suggesting that other radiologist characteristics are more important (8).
Although outcome in clinical practice is frequently viewed as the reference standard measurement for accuracy (9), it is often difficult to measure in screening mammography because of the rarity of cancer findings. There may also be confounding between the organization of the clinical environment, the relative amounts of screening and diagnostic breast radiology, the patient population, and radiologist characteristics when outcomes are examined in clinical situations. Many of these difficulties can be mitigated by the examination of outcomes in large population-based screening programs.
All of the provinces and territories of Canada offer screening mammography services to female residents through organized screening programs that have been in existence for up to 15 years (10). Although not identical, these programs are quite similar and offer screening only to self-referred asymptomatic women (11). The provincial programs contribute to a national database on screening mammography performance that uses standardized variable definitions (12); this database includes the results of more than 4 million screening examinations. The purpose of this study was to examine retrospectively the relationship between radiologist screening program reading volumes and interpretation results.
 |
MATERIALS AND METHODS
|
|---|
Program Participation
Ethical approval for the conduct of this research was obtained from the University of British Columbia Research Ethics Board. Informed patient consent was not required. Provincial breast cancer screening programs participating in the Canadian Breast Cancer Screening Initiative (11) were invited to participate in the study. This initiative is supported by Health Canada, a department of the Canadian Government; this department also provided support for an investigators' meeting and analysis and data collection costs. Each program maintains longitudinal data on screening examinations conducted within the program and captures data on subsequent assessment of abnormal screening examinations, which may be performed in designated centers or within the community. Data on screening-detected cancers are obtained through this follow-up mechanism and supplemented through record linkage with cancer registries in each province.
Data Collection
Programs that agreed to participate in the study submitted data to the British Columbia Screening Mammography Program, where analysis was conducted. The data set consisted of counts of the number of screening examinations, abnormal mammographic interpretations, and cancers detected with mammography; these counts were tabulated according to encrypted radiologist identifying number, province, year, age of woman at screening (4049, 5059, 6069, or 7079 years), and screening examination sequence (first or subsequent screening examination) for the 3 years 19982000. Abnormal interpretations were those for which the radiologist indicated the need for further investigation, and screening-detected cancers were any cancers discovered as a result of such an investigation. Cancers detected subsequent to an abnormal screening examination but not as a result of the diagnostic investigation were not included as screening-detected cancers. Postscreen or interval cancers were not considered because of concerns from some provinces about the completeness of record linkage with cancer registries. All mammographic screening data collected related only to activity performed within the screening programs.
The total number of screening examinations performed by each radiologist during the 3-year study period was divided by three to estimate the average annual reading volume. Year-to-year variability in radiologist reading volumes was examined by comparing the individual average annual volume with the corresponding annual counts for each year of data. For radiologists who participated in a screening program for only a portion of the study period, averages were based on their participation periods.
Statistical Analysis
Age- and screening sequencespecific values were calculated for abnormal interpretation rates, cancer detection rates, and positive predictive values (PPVs). These were computed, respectively, as the proportion of screening mammograms in which further investigation was recommended, the proportion of screening mammograms in which further investigation was recommended and cancer was detected, and the proportion of recommendations for further investigation after which cancer was detected. Multivariate hierarchical Bayesian analyses of these outcomes were performed; the outcome was assumed to be distributed as a Poisson variable (13). The mean value of the Poisson variable was assumed to be log linear in the regression coefficients, including the interradiologist effect, with the number of screening examinations or number of abnormal interpretations used as an offset as appropriate.
The model was fit by using WinBUGS (14). Regression coefficients, which are equal to the logarithm of the relative risk (RR), were assumed to be normally distributed with a mean of zero and a prior variance of 106. The interradiologist effect was assumed to be normally distributed with a mean of zero and a standard deviation of
, which was assigned a uniform prior on {0, 100}. Models were fit by running 110 000 iterations, discarding the first 10 000 iterations, and using every 100th iteration. Trace plots were examined to determine convergence, and we report the posterior mean and 95% posterior interval.
The analysis assumes that risks associated with each factor in the model act multiplicatively, as measured by the RR, to influence the overall rate at which a specified event occurs (eg, identification of a case of cancer for the cancer detection rate). Each radiologist is hypothesized to have his or her own RR, which also acts multiplicatively, and statistically each of these is viewed as a single sample from the population of radiologist RRs. In this analysis, we estimated the standard deviation of the distribution of the population of radiologist RRs. Interradiologist variation was summarized by the median RR between two radiologists chosen at random (the higher rate was divided by the lower). This general modeling approach has been previously used in a similar application (15).
 |
RESULTS
|
|---|
Included Programs
The screening programs of the following provinces agreed to participate: Alberta, British Columbia, Manitoba, Newfoundland, Nova Scotia, Ontario, and Quebec. They provided results of 1 543 331 screening examinations to the study. Table 1 lists the distribution of screening examinations provided by each province and the number of radiologists providing interpretations. Individual radiologist reading volumes were based on these data. Radiologists who did not interpret an average of 480 or more program screens per year over the study period were not included in subsequent analysis. This caused the exclusion of 280 radiologists (Table 1) who interpreted 134 412 screening examinations. A further 2241 screening examinations from Manitoba and Newfoundland were excluded because they were provided to women younger than 50 years or older than 70 yearsage ranges that are outside the age range eligible for the programs in these provinces. Subsequent analysis was based on 1 406 678 screening examinations whose results were interpreted by 304 radiologists.
Reading Volumes
Screening recommendations vary by province, with all programs providing screening to women aged 5069 years and some providing screening to women aged 4049 or 7079 years. The Ontario data included no screening examinations for women under the age of 50 years, and the Quebec data included only first screening examinations to women between 50 and 69 years of age because the program was created in 1998. Table 2 provides a distribution of the screening data by province, age of woman, and screening sequence. The distribution of radiologist reading volume varied among provinces, reflecting the different ages and organizational structures of the provincial programs. For subsequent analysis, radiologist reading volume was categorized according to convenient cut points. For the volume categories in Table 3, 72% of the annual volumes for radiologists were in the same category as their 3-year averages, with another 26% in an adjacent category. Two percent of annual volumes differed by more than one category from their 3-year average.
Cancers
There were a total of 7031 screening-detected cancers, for an average cancer detection rate of 5.0 per 1000 screening examinations. Cancer detection rates for first and subsequent screening examinations were 6.0 and 4.2 per 1000, respectively. Cancer detection rates increased with the age of the women screened and were higher at all ages for first versus subsequent screening examinations. The overall abnormal interpretation rate was 7.7 per 100 screening examinations. The abnormal interpretation rate differed between first and subsequent screening examinations, with rates of 10.7 and 5.3 per 100, respectively. Abnormal interpretation rates declined slightly with age for both first and subsequent screening examinations. The average PPV was 6.5 per 100 abnormal interpretations. PPVs were greater for subsequent than for first screening examinations: 7.9 versus 5.6 per 100 abnormal examinations. PPVs increased with age for both first and subsequent screening examinations.
Predictors and Outcomes
Separate hierarchical analyses were used to determine the relationship between potential predictors and outcomes (cancer detection rate, abnormal interpretation rate, and PPV). The following variables were included in each analysis: age, screening sequence, province, average radiologist volume, and interradiologist effect. Tables 46 contain the results of these analyses for the cancer detection rate, abnormal interpretation rate, and PPV.
View this table:
[in this window]
[in a new window]
|
Table 4. Results of Hierarchical Poisson Modeling for Analysis of Cancer Detection Rate, Including Patient Age, Screening Sequence, Province, Radiologist Volume, and Interradiologist Variation
|
|
View this table:
[in this window]
[in a new window]
|
Table 5. Results of Hierarchical Poisson Modeling for Analysis of Abnormal Interpretation Rate, Including Patient Age, Screening Sequence, Province, Radiologist Volume, and Interradiologist Variation
|
|
View this table:
[in this window]
[in a new window]
|
Table 6. Results of Hierarchical Poisson Modeling for Analysis of PPV, Including Patient Age, Screening Sequence, Province, Radiologist Volume, and Interradiologist Variation
|
|
Age and screening sequence were related to all three outcomes (cancer detection rate, abnormal interpretation rate, and PPV); however, the strength of the relationships varied (Tables 46). Posterior density intervals (similar to confidence intervals) are provided in Tables 46. None of the three outcomes (cancer detection rate, abnormal interpretation rate, and PPV) appeared to vary by province, although one interval (that for cancer detection rate for Quebec) did not include unity.
Interradiologist variation was measured according to the median RR between two radiologists chosen at random. For example, it was 1.19 for the cancer detection rate (Table 4), indicating that, for two randomly chosen radiologists, one radiologist would have a cancer detection rate that was at least 19% greater than the other's rate in 50% of such random pairs. Interradiologist variation was present for all three outcomes, and none of the posterior intervals included unity. The interradiologist variation was smallest for the cancer detection rate, indicating that relative differences among radiologists for this outcome were smaller than those for the other outcomes considered. The abnormal interpretation rate had the highest interradiologist variability, with a median effect that was comparable in magnitude to the overall difference between first and subsequent screening examinations. Radiologist reading volume had no consistent effect on either the cancer detection rate or the abnormal interpretation rate, and the posterior intervals for all but one category included unity. However, there was a consistent pattern of increasing PPV with higher volumes, and posterior intervals for the RR did not include unity for volumes of more than 2000 screening examinations per year. Relative risks were similar for reading volume categories above 2000.
The interradiologist coefficients estimated from the model were plotted according to reading volume category for each of the outcomes considered. Variability in these coefficients is a measure of the difference between individual radiologist RRs in the same volume category. No patterns were observed, indicating that radiologist reading volumes had no effect on interradiologist variation.
 |
DISCUSSION
|
|---|
There have been several analyses of radiologist characteristics and measures of accuracy in screening mammography (2,58,1618). These studies have not all collected the same information and have identified various factors that influence accuracy, including radiologist reading volume (5,6), specialization (2), years since residency (7), and residency rotation in mammography (7). The way in which studies have evaluated accuracy has also varied; many have used results from individual radiologists in screening settings (2,6,8), while others have used test sets that were reviewed by all radiologists in controlled conditions (5,7). Few studies have attempted to compare the relationship between clinical and test performance in screening mammography, but one that did found little relationship (9).
Studies have used different measures of accuracy, with some using cancer detection rates and abnormal interpretation rates (2,4,6,8) and others using sensitivity and specificity (5,7). Methods for analyzing data have also differed; some authors just report rates (6,8), some use multivariate Bayesian models (15), and some use multivariate receiver operating characteristic methods (5,7). The aforementioned differences among studies make it difficult to compare their results and draw conclusions; however, it is clear that some attributes of radiologic practice are associated with accuracy.
There can be little doubt that reading volume has some influence on accuracy, because, like any skill, radiologic interpretation must be practiced. Screening performance is multidimensional, and different attributes may be differentially related to reading volumes, if they are related at all. Studies have evaluated international variation in screening performance and how this correlates with the reading volumes required of screening radiologists (4,19). However, there are also other practice differences among various countries, and different aspects of interpretation are emphasized (19).
In examining practice data on screening accuracy, one is faced with certain statistical issues that make analysis more difficult. Individual women vary in characteristics that influence the ease of radiologic interpretation, and studies such as ours may not capture this variation. Cancer is quite rare; therefore, for radiologists with low reading volumes, observed cancer detection rates will be subject to substantial random variation. For example, radiologists interpreting 500 screening mammograms per year would detect only an expected 7.5 cancers during 3 years in this study. Considerable fluctuations in individual cancer detection rates can be expected to occur by chance alone. This difficulty is alleviated in studies that use preselected enriched series, although care must be taken regarding context bias (20) and whether results are obtained in test conditions and not in the course of usual clinical care.
The problem of random variation is not as pronounced for the study of abnormal interpretation rates, or specificity, because 5%10% of screening mammograms will be interpreted as abnormal in North American screening (19). Radiologists interpreting 500 screening mammograms per year can be expected to have about 100 abnormal interpretations over a 3-year period. Investigators who examine radiologist performance note considerable interradiologist variation, and some explicitly attempt to estimate it (15). The aforementioned sources of variation make it more difficult to identify factors that affect the accuracy of radiologic interpretation in clinical series. Thus, to identify such factors, one must have a large number of observations on many radiologists.
Our study results indicate that interradiologist variation was one of the strongest influences on the abnormal interpretation rate and that it had an effect, as measured by the RR (Table 5), that approached the effect of differences between first and subsequent screening examinations. In contrast, interradiologist variation had a smaller RR for the cancer detection ratean effect that was exceeded by the effect of age and screening sequence. This seems to mimic the effect seen in international comparisons, wherein cancer detection rates vary little (4) but abnormal interpretation rates vary considerably (4,19). Abnormal interpretation rates reflect the "set point" of the radiologist and will strongly influence his or her PPV. In this study, the PPV increased with the volume of screening examinations interpreted up to about 2000 annual screening examinations but then stabilized. Compared with the PPV observed for radiologists who interpreted 480699 screening examinations, the relative improvement in PPV for those who interpreted 2000 or more screening examinations was approximately 30%. This improvement is a meaningful difference, although it is an association and one cannot exclude confounding or infer causality.
Like any research based on observational data, this study had some limitations. Data were not available on other variables that may affect radiologist performance, such as recent education and practice characteristics. Radiology practice in Canada includes a variable mix of diagnostic work that may also influence mammographic screening performance. Screening volumes were estimated by using only data from screening programs, but in some provinces, radiologists may be involved in screening outside a program. Any bias in volumes would most likely cause effects to be underestimated, although true confounding cannot be ruled out. There are also differences among the screening programs in the Canadian provinces that are related to radiologist volumes. Sixty-three percent of the study radiologists who read more than 3000 mammograms annually came from British Columbia, while 73% of radiologists who read fewer than 1000 mammograms came from Quebec. The program in British Columbia is the oldest in Canada, while the one in Quebec is the newest. Despite some differences among the individual provincial programs, there was no evidence from the analysis that these differences influenced the results, as evidenced by the paucity of significant province coefficients in the multivariate analyses (Tables 46). The similarities between the provincial programs are undoubtedly greater than their differences; hence, the results of this analysis of more than 1.4 million screening examinations whose results were interpreted by approximately 300 radiologists in real-world situations provide useful information on the determinants of performance.
This study revealed little evidence of a substantial relationship between screening volumes and the cancer detection rate in Canadian mass screening programs. There was a trend, as identified by the PPV, for radiologists with higher reading volumes to be better able to select for further investigation women who were likely to have breast cancer. This is important because in any clinical situation one wishes to minimize harms, both to the patient in terms of anxiety and to the health system in terms of cost, while providing the maximum benefit. The requirement by some Canadian screening programs of minimum annual volumes that are higher than the 480 mammograms specified by the Canadian Association of Radiologists is supported by the results of this analysis.
 |
FOOTNOTES
|
|---|
Abbreviations: PPV = positive predictive value RR = relative risk
Authors stated no financial relationship to disclose.
Author contributions: Guarantors of integrity of entire study, A.J.C., D.M., R.S., N.W.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; literature research, A.J.C., R.S., N.W.; clinical studies, R.S., N.W.; experimental studies, D.M.; statistical analysis, A.J.C., Y.D., N.P., N.E.S.; and manuscript editing, A.J.C., D.M., G.P.D., J.O., R.S., N.E.S., N.W.
 |
References
|
|---|
- Vainio H, Bianchini F, eds. Breast cancer screening. In: Vainio H, ed. IARC handbooks of cancer prevention. Vol 7. Lyon, France: IARCPress, 2002.
- Sickles EA, Wolverton DE, Dee KE. Performance parameters for screening and diagnostic mammography: specialist and general radiologists. Radiology 2002;224(3):861869.[Abstract/Free Full Text]
- Commonwealth Department of Human Services and Health. National program for the early detection of breast cancer: national accreditation requirements. Canberra, Australia: Commonwealth Department of Human Services and Health, 1994.
- Smith-Bindman R, Chu PW, Miglioretti DL, et al. Comparison of screening mammography in the United States and the United Kingdom. JAMA 2003;290(16):21292137. [Published correction appears in JAMA 2004;291(7):824.][Free Full Text]
- Esserman L, Cowley H, Eberle C, et al. Improving the accuracy of mammography: volume and outcome relationships. J Natl Cancer Inst 2002;94(5):369375.[Abstract/Free Full Text]
- Kan L, Olivotto IA, Warren Burhenne LJ, Sickles EA, Coldman AJ. Standardized abnormal interpretation and cancer detection ratios to assess reading volume and reader performance in a breast screening program. Radiology 2000;215(2):563567.[Abstract/Free Full Text]
- Beam CA, Conant EF, Sickles EA. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. J Natl Cancer Inst 2003;95(4):282290.[Abstract/Free Full Text]
- McKee MD, Cropp MD, Hyland A, Watroba N, McKinley B, Edge SB. Provider case volume and outcome in the evaluation and treatment of patients with mammogram-detected breast carcinoma. Cancer 2002;95(4):704712.
- Rutter CM, Taplin S. Assessing mammographers' accuracy. A comparison of clinical and test performance. J Clin Epidemiol 2000;53(5):443450.[CrossRef][Medline]
- Shapiro S, Coleman EA, Broeders M, et al. Breast cancer screening programmes in 22 countries: current policies, administration and guidelines. International Breast Cancer Screening Network (IBSN) and the European Network of Pilot Projects for Breast Cancer Screening. Int J Epidemiol 1998;27(5):735742.[Abstract/Free Full Text]
- Paquette D, Snider J, Bouchard F, et al. Performance of screening mammography in organized programs in Canada in 1996. The Database Management Subcommittee to the National Committee for the Canadian Breast Cancer Screening Initiative. CMAJ 2000;163(9):11331138.[Abstract/Free Full Text]
- Health Canada. Organized breast screening programs in Canada: 1999 and 2000 report. Ottawa, Canada: Minister of Public Works and Government Services Canada, 2003.
- Congdon P. Bayesian statistical modelling. New York, NY: J Wiley, 2001.
- Spiegelhalter D, Thomas A, Best N. WinBUGS version 1.2 user manual. Cambridge, England: Medical Research Council Biostatistics Unit, 1999.
- Christiansen CL, Wang F, Barton MB, et al. Predicting the cumulative risk of false-positive mammograms. J Natl Cancer Inst 2000;92(20):16571666.[Abstract/Free Full Text]
- Nodine CF, Kundel HL, Lauver SC, Toto LC. Nature of expertise in searching mammograms for breast masses. Acad Radiol 1996;3(12):10001006.[CrossRef][Medline]
- Nodine CF, Kundel HL, Mello-Thoms C, et al. How experience and training influence mammography expertise. Acad Radiol 1999;6(10):575585.[CrossRef][Medline]
- Elmore JG, Wells CK, Howard DH. Does diagnostic accuracy in mammography depend on radiologists' experience? J Womens Health 1998;7(4):443449.[Medline]
- Elmore JG, Nakano CY, Koepsell TD, Desnick LM, D'Orsi CJ, Ransohoff DF. International variation in screening mammography interpretations in community-based programs. J Natl Cancer Inst 2003;95(18):13841393.[Abstract/Free Full Text]
- Egglin TK, Feinstein AR. Context bias: a problem in diagnostic radiology. JAMA 1996;276(21):17521755.[Abstract/Free Full Text]
This article has been cited by other articles:

|
 |

|
 |
 
D. L. Miglioretti, R. Smith-Bindman, L. Abraham, R. J. Brenner, P. A. Carney, E. J. A. Bowles, D. S. M. Buist, and J. G. Elmore
Radiologist Characteristics Associated With Interpretive Performance of Diagnostic Mammography
J Natl Cancer Inst,
December 19, 2007;
99(24):
1854 - 1863.
[Abstract]
[Full Text]
[PDF]
|
 |
|