Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


DOI: 10.1148/radiol.220001257
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Jiang, Y.
Right arrow Articles by Doi, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jiang, Y.
Right arrow Articles by Doi, K.
(Radiology. 2001;220:787-794.)
© RSNA, 2001


Breast Imaging

Potential of Computer-aided Diagnosis to Reduce Variability in Radiologists’ Interpretations of Mammograms Depicting Microcalcifications1

Yulei Jiang, PhD, Robert M. Nishikawa, PhD, Robert A. Schmidt, MD 2, Alicia Y. Toledano, ScD 3 and Kunio Doi, PhD

1 From the Kurt Rossmann Laboratories for Radiologic Image Research, Dept of Radiology (Y.J., R.M.N., R.A.S., K.D.), and the Depts of Anesthesia and Critical Care (A.Y.T.) and Health Studies (A.Y.T.), Univ of Chicago, 5841 S Maryland Ave, MC2026, Chicago, IL 60637. Received Jul 13, 2000; revision requested Aug 21; final revision received Jan 15, 2001; accepted Feb 15. Supported in part by NIH grant CA 60187. Address correspondence to Y.J. (e-mail: y-jiang@uchicago.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
PURPOSE: To evaluate whether computer-aided diagnosis can reduce interobserver variability in the interpretation of mammograms.

MATERIALS AND METHODS: Ten radiologists interpreted mammograms showing clustered microcalcifications in 104 patients. Decisions for biopsy or follow-up were made with and without a computer aid, and these decisions were compared. The computer was used to estimate the likelihood that a microcalcification cluster was due to a malignancy. Variability in the radiologists’ recommendations for biopsy versus follow-up was then analyzed.

RESULTS: Variation in the radiologists’ accuracy, as measured with the SD of the area under the receiver operating characteristic curve, was reduced by 46% with computer aid. Access to the computer aid increased the agreement among all observers from 13% to 32% of the total cases (P < .001), while the {kappa} value increased from 0.19 to 0.41 (P < .05). Use of computer aid eliminated two-thirds of the substantial disagreements in which two radiologists recommended biopsy and routine screening in the same patient (P < .05).

CONCLUSION: In addition to its demonstrated potential to improve diagnostic accuracy, computer-aided diagnosis has the potential to reduce the variability among radiologists in the interpretation of mammograms.

Index terms: Breast neoplasms, calcification, 00.81 • Breast neoplasms, diagnosis, 00.30 • Computers, diagnostic aid • Diagnostic radiology, observer performance


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Multiple investigators (14) have shown that considerable variability exists among radiologists in the interpretation of mammograms. This variability affects the diagnostic accuracy of radiologists, as measured with receiver operating characteristic (ROC) analysis. Moreover, it directly affects their clinical decisions to recommend either biopsy or follow-up. Because such variability decreases the clinical effectiveness of breast cancer screening, it should be eliminated whenever possible. Some (57) have suggested that computer-aided diagnosis (CAD), in which a radiologist combines an independent analysis of mammograms performed by using a computer technique with his or her own reading, can potentially reduce interpretation variability. However, to our knowledge, this potential of CAD has not yet been demonstrated. We analyzed data obtained in an observer study (8) to compare variabilities in the interpretation of mammograms with and without use of a computer aid. Previously, we analyzed the data of that observer study and found that radiologists can improve their diagnostic performance by using a computer aid (8). The purpose of this study was to evaluate whether CAD can reduce interobserver variability among radiologists in the interpretation of mammograms.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Case Materials
We obtained (from the University of Chicago Hospitals, Illinois) 104 mammograms of 46 consecutive malignant and 58 consecutive benign clustered microcalcifications that were examined at biopsy. Our institutional review board approved a waiver for patient consent for this study because our study involved only retrospective review of existing mammograms. We included only cases of microcalcification because our computer aid was specifically designed to analyze this common type of mammographic lesion (work on computer analysis of breast masses is ongoing [9]) and because microcalcifications are often the only mammographic indication of breast cancer (10).

Of the malignant cases, 37 were ductal carcinoma in situ, and nine were invasive ductal carcinoma. Of the benign cases, two were lobular carcinoma in situ, four were atypical ductal hyperplasia, 16 were hyperplasia without atypia, seven were adenosis, six were fibroadenoma, 18 were fibrocystic change or fibrosis, and five were breast tissue without specific abnormality.

Consecutive cases were collected by using the following criteria: (a) A cluster of microcalcifications was the only suspect lesion, which led to the biopsy and for which the pathologic results were definitive; (b) original mammograms, including at least two standard views and one magnification view, were available; and (c) the technical quality of the mammograms was adequate for interpretation (8). To balance the number of malignant and benign cases and thereby increase statistical power, the malignant cases were collected, necessarily, from a longer period (11,12). These cases were clinically evaluated before the Breast Imaging Reporting and Data System was implemented; therefore, they were not assigned a Breast Imaging Reporting and Data System assessment category (8). Additional specific details regarding case selection are reported elsewhere (8).

Radiologist Observers
Ten radiologists, who had experience in mammography but who had not previously seen the study cases, interpreted the mammograms. Five observers were practicing radiologists from the Chicago metropolitan area, and five were senior radiology residents from our institution. For the attending radiologists, mammography accounted for an average of 30% of their clinical practice, and they were certified readers according to the Mammography Quality Standards Act. They had been reading mammograms for an average of 9 years (median, 6 years; range, 1–30 years), and they had read at least 1,000 mammograms in the preceding year. The residents had limited experience from training rotations of 1–2 months duration. Written informed consent, as approved by our institutional review board, was obtained from all observers after the nature of the experiment was fully explained. Data analysis was performed for three observer groups: all observers (n = 10), attending radiologists (n = 5), and residents (n = 5).

Computer Aid
The computer aid was an estimate of the likelihood (0%–100%) that a microcalcification cluster was due to a malignancy. An artificial neural network calculated the estimate on the basis of eight image features that were automatically extracted from standard-view screen-film mammograms (13). Mammograms were digitized with a 0.1-mm pixel size and a 12-bit gray scale by using a digitizer (Lumiscan 100; Lumisys, Sunnyvale, Calif). Locations of microcalcifications were manually identified on a computer monitor (8).

The observers were explicitly instructed to use the computer aid in their interpretation. They were told that the computer output had a sensitivity (defined as the fraction of cancers for which biopsy would have been recommended) of approximately 90% and positive predictive value (defined as the fraction of all cases for which biopsy would have been recommended that were cancers) of approximately 61% when a threshold of 30% was applied to the computer-estimated likelihood of malignancy. The performance estimates of the computer were obtained from the study cases. One interpretation of this instruction is that any observer could have achieved the same accuracy as the computer by recommending biopsy only when the computer reported a likelihood of malignancy of 30% or greater.

Data Acquisition
Each observer reviewed the cases twice: once with and once without the computer aid; each review was separated by an average of 30 days (range, 10–60 days). The following counterbalanced study design was used: Half of the mammograms were read without the computer aid in the first reading session and were read again with the computer aid in the second reading session; the other half of the mammograms were read first with the computer aid and then without the aid. The study design minimizes potential biases; it has been well documented (11,12) and has been described (8) in detail. The observers were asked to report (a) their level of confidence (on an analog scale of 0%–100%) that a lesion was malignant and (b) their clinical recommendation (Table 1).


View this table:
[in this window]
[in a new window]

 
TABLE 1. Clinical Recommendations Available to the Observers

 
Data Analyses
We assessed interpretation variability by using three methods: (a) sensitivity, specificity, and ROC analysis; (b) analysis of interobserver agreement; and (c) analysis of substantial disagreements in clinical recommendations. Interobserver variability was assessed in these analyses. Intraobserver variability was not measured because no observer repeated mammographic interpretation either with or without the computer aid. Custom software was used to perform all calculations except calculations of {kappa} values, which were determined by using other software (SPLUS; MathSoft, Seattle, Wash). The Student t and McNemar {chi}2 tests were used to calculate P values, and the bootstrap method was used to estimate the 95% CIs in the statistical analyses.

Sensitivity was defined as the fraction of cancers for which surgical biopsy or alternative tissue sampling was recommended. Specificity was defined as the fraction of benign lesions for which short-term or routine follow-up was recommended. Because sensitivity and specificity incompletely describe accuracy and because they depend on how a radiologist selects a decision threshold to define positive diagnoses, we also performed an ROC analysis, which is the standard method for evaluating observer accuracy (6,14,15). We obtained ROC curves by fitting the binormal model to the confidence data, and we obtained summary ROC curves for the 10 observers as a group by averaging the slope and intercept parameters of the individual curves (14). The area under the ROC curve (Az) was used as a summary index of accuracy. Az can have values between 0.5, which represents no apparent accuracy (diagnoses corresponding to random chance alone), and 1.0, which represents perfect accuracy.

A histogram of interobserver agreement regarding clinical recommendations was constructed, and the {kappa} statistic was computed. This histogram displayed the number of cases as a function of the number of observers in agreement. For the 10 observers, 11 patterns of agreement in the recommendations were possible; these patterns included 10 biopsy recommendations, nine biopsy and one follow-up recommendations, eight biopsy and two follow-up recommendations, and so on. For this analysis, we compared the recommendations for biopsy (option a or b in Table 1) versus those for follow-up (option c or d in Table 1), because this is the most important clinical decision. Separate histograms were constructed for cancers and benign lesions. Separate histograms were also constructed for attending radiologists, residents, and all radiologists. The histograms were similar for the three observer groups; we report only the summary histogram of all radiologists combined.

The {kappa} statistic is widely used as a measure of agreement (16). It reflects the proportion of agreement after the proportion of agreement that can be attributed to chance alone is subtracted (17). {kappa} equals 1 for perfect agreement, and {kappa} equals 0 when the agreement can be attributed to chance alone. We computed the multireader {kappa} value (18) and estimated the 95% CIs by using the bootstrap method.

Using the definitions by Elmore et al (1), we defined substantial disagreement as a situation in which one radiologist recommended biopsy (option a or b in Table 1) and another recommended routine follow-up (option d in Table 1) in the same case (short-term follow-up was excluded from this particular analysis to emphasize extremes in decision making). Pairwise and per-patient frequencies of substantial disagreement were calculated. The pairwise frequency was the occurrence of substantial disagreement in all recommendation pairs (ie, recommendations made by two different observers in the same case). The total number of recommendation pairs was equal to the following: [number of cases x number of readers x (number of readers - 1)]/2. For 10 observers, there were a total of (104 x 10 x 9)/2, or 4,680, recommendation pairs. The per-patient frequency was the fraction of total cases (ie, 104 cases) in which different observers simultaneously recommended at least one biopsy procedure and at least one routine screening procedure. Because of the large differences in the denominators, the pairwise frequency tended to produce a low estimate, and the per-patient frequency tended to produce a high estimate of the substantial disagreement; neither was clinically accurate, because it was unlikely that 10 radiologists would have independently evaluated the case in clinical practice. Because the true frequency of substantial disagreement was expected to be between the pairwise and per-patient frequencies and because, to our knowledge, no single accurate measure is known, we report both pairwise and per-patient results, as Elmore et al (1) did.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Effect of the Computer Aid on Sensitivity, Specificity, and ROC Curves
Sensitivity and specificity data and summary ROC curves are shown in Figure 1. The ranges and averages of the sensitivity, specificity, and positive predictive values are shown in Table 2. For the group of all observers (n = 10), without the computer aid there was a range of 35% in sensitivity and 44% in specificity. When the computer aid was used, the range in sensitivity was reduced to 26%, but the range in specificity remained 45%. Results for the groups of attending radiologists (n = 5) and residents (n = 5) were similar (Table 2). The average sensitivity, specificity, and positive predictive values increased significantly with the computer aid (8). Table 3 lists the Az values. The SD of Az values was reduced from 0.056 to 0.030, or 46%, with the computer aid.



View larger version (23K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1. ROC curves and sensitivity and specificity data obtained from the interpretation of 104 mammograms by 10 radiologists. A cluster of microcalcifications was present in all cases; 46 cancers and 58 benign lesions were confirmed at biopsy. The effect of a computer aid was tested; it provided an estimate of the likelihood that microcalcifications were due to a malignancy. Sensitivity and specificity results were based on the radiologists’ recommendations for biopsy or follow-up. The ROC curves were based on the radiologists’ diagnostic confidence.

 

View this table:
[in this window]
[in a new window]

 
TABLE 2. Effect of CAD on Sensitivity, Specificity, and Positive Predictive Values

 

View this table:
[in this window]
[in a new window]

 
TABLE 3. Effects of CAD on Az

 
Effect of Computer Aid on Agreement in Recommendations
The histogram of interobserver agreement (Fig 2) provides detailed information concerning the extent of agreement, for both cancers and benign lesions, and the changes as a result of the computer aid. With the computer aid, complete agreement among all 10 radiologists was achieved in 20 (43%) cancer cases. Agreement in benign cases had a broader distribution. Highlights of Figure 2 are summarized in Table 4. Without the computer aid, complete agreement by all observers on a correct recommendation (biopsy for cancers and follow-up for benign lesions) occurred in nine cases (nine malignant and no benign lesions). With computer aid, the complete agreement on a correct recommendation increased to 26 cases (20 malignant and six benign lesions). Conflicting recommendations in which the minority consisted of more than 20% (ie, three to five of 10 observers or two of five observers) of the total observers occurred in 43 cases without aid; this number was reduced to 28 cases with the computer aid. Use of the computer aid improved agreement and reduced the occurrence of conflicting recommendations in all data categories (P < .05 with one exception of P = .07 in decreasing conflicting recommendations among residents; McNemar {chi}2 test).



View larger version (34K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 2. Histograms show the effect of CAD on the agreement in the recommendations for clinical management that were made by the 10 radiologists. Recommendations for biopsy versus any type of follow-up were made after the radiologists independently interpreted mammograms that depicted clustered microcalcifications. The computer aid provided an estimate of the likelihood that the microcalcifications were due to a malignancy. Black bars = without the computer aid, white bars = with the computer aid, and * = ideal situation of complete agreement in the correct recommendation.

 

View this table:
[in this window]
[in a new window]

 
TABLE 4. Effect of CAD on Agreement of Clinical Recommendations

 
{kappa} values are shown in Table 5. The results were consistent among the three observer groups (all radiologists, attending radiologists, and residents). Although the residents’ {kappa} values were smaller than those of the attending radiologists, the differences were not statistically significant (P > .05). In all three observer groups, use of the computer aid improved agreement from fair to moderate (on the ordinal scale where a {kappa} value of 0.21–0.40 represents fair agreement beyond chance, and 0.41–0.60 represents moderate agreement beyond chance [19]). All improvements were statistically significant (P < .05).


View this table:
[in this window]
[in a new window]

 
TABLE 5. Effect of CAD on Agreement

 
Effect of Computer Aid on Substantial Disagreement in Recommendations
Interobserver agreement implicitly quantifies disagreement, but it does not distinguish between minor disagreements and completely incompatible diagnoses. Substantial disagreements represent contradictory diagnoses that can potentially cause greater confusion for the referring physicians and patients. Figure 3 shows the pairwise and per-patient frequencies of substantial disagreements. For recommendations made by attending radiologists without aid, the pairwise frequency of contradiction was 7%, and the per-patient frequency of contradiction was 23%. The frequencies were higher among residents: The pairwise frequency was 19%, and the per-patient frequency was 51%. Use of the computer aid reduced all occurrences of substantial disagreements. The reductions averaged 63% among attending radiologists and 28% among residents. The reduction was statistically significant for all cases combined and for cancers alone (P < .04 and {chi}2 > 4.33, with one exception of P = .052 and {chi}2 = 3.77 for residents and cancers alone; McNemar {chi}2 test [The degree of freedom for the McNemar {chi}2 test is always 1.]). The reduction was not statistically significant for benign cases alone (P > .08, {chi}2 < 3.00; McNemar {chi}2 test).



View larger version (33K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 3. Histograms show the effect of CAD on substantial disagreements in clinical recommendations (ie, biopsy vs routine screening). Data shown are pairwise (top) and per-patient (bottom) frequencies. Pairwise frequencies were calculated from all pairs of recommendations made by two different radiologists. Per-patient frequencies were calculated from the total number of cases in which the recommendations were made by multiple radiologists (n = 5 for attending radiologists, n = 5 for residents, and n = 10 for all readers). Black bars = without the computer aid, and white bars = with the computer aid.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Interpretation of Results
To our knowledge, there is no single measure of agreement that can be universally used to quantify interpretation variability. Although {kappa} is widely used as a quantitative measure of agreement, it is not without limitations when the findings of different studies are compared (20). More important, there is no explicit relationship between {kappa} statistic and ROC analysis; the latter is often used to quantify diagnostic accuracy. Therefore, we extended our calculations beyond determining the {kappa} value to three separate but related analyses. First, we calculated the variability that is evident in the ROC summary indices. This analysis could serve as a direct link between the calculation of variability and the calculation of diagnostic accuracy by means of ROC analysis. Second, we calculated the {kappa} value and the pattern of agreement. Third, we assessed variability from the points of view of the referring physician and the patient by using a calculation in the literature (1). Each of the three analyses addressed a different aspect of variability, and together they helped to define its magnitude and the ability of CAD to help reduce the variability.

Our analysis revealed that there is considerable variability in the interpretation of mammograms by radiologists; this finding is consistent with that of other studies (14). We found similar or poorer agreement between radiologists, compared with the results of Elmore et al (1). Elmore et al reported a per-patient substantial-disagreement frequency of 25%, which is similar to our result of 23% for attending radiologists. However, our {kappa} values were generally lower; these values suggested poorer agreement. This result may have been caused by differences in the calculation of {kappa} values (ie, averaging two-reader {kappa} values [1] vs calculating multireader {kappa} values); also, our study (8) did not include cases that were not evaluated at biopsy. To increase the statistical power of the study by enhancing the proportion of cases that were difficult to diagnose, only abnormal cases that had biopsy confirmation were used in our study (11). These difficult cases can be presumed to generate more variability in interpretation. We found a range of 35% in sensitivity and a range of 44% in specificity. These results agree with the ranges of 53% in sensitivity and 45% in specificity reported by Beam et al (2), who studied the results of 108 radiologists.

Two sources can potentially generate variability in the interpretation of mammograms. First, variations in diagnostic accuracy (ie, variations in radiologists’ abilities to correctly diagnose cancerous and cancer-free lesions) may be a primary source of variability. Az values vary as a result of this variation. Second, a radiologist’s selection of a decision threshold that defines a positive diagnosis in his or her interpretation can also produce variability (6). A decision threshold is necessary in all binary diagnostic tasks, and its selection is influenced by a radiologist’s perception of disease prevalence and the benefits and costs associated with correctly diagnosing the disease (14). Although selection of different thresholds causes sensitivity and specificity to vary simultaneously and in opposite directions, such variations are not caused by and do not represent variations in diagnostic accuracy (6). Selection of the different thresholds does not cause Az values to vary because an ROC curve, for which Az is a summary index, depicts all of the tradeoffs available as the threshold is varied. Therefore, selection of the decision threshold is an issue that is separate from the variation in diagnostic accuracy as quantified with Az values.

Our results (Fig 1) showed that the sensitivity and specificity data points were on or near the average ROC curves; these results indicated that much of the variation in sensitivity and specificity was caused by the use of different decision thresholds during interpretation and not by variations in diagnostic accuracy. This is consistent with the interpretation of D’Orsi and Swets (6) of the results of Elmore et al (1). As one might expect, the similarity in the ranges of sensitivity and specificity with and without use of a computer aid indicated that CAD had little influence on the radiologists’ choices of decision thresholds, since CAD is not expected to influence the radiologists’ perception of disease prevalence and the benefits and costs associated with correctly diagnosing the disease. The improvement in accuracy achieved with CAD is a result of the radiologists being able to improve their performance, as reflected with a different (higher) ROC curve. Moreover, compared with the without-aid data, the decreased dispersion of the with-aid sensitivity and specificity data points from the average ROC curve shows that CAD helped the radiologists to interpret mammograms with a more uniform, as well as higher, level of accuracy. Therefore, although CAD caused little change in the ranges of sensitivity and specificity (which in our study appear to have been determined largely by the radiologists’ choice of decision thresholds), our results showed that CAD helped the radiologists to reduce variation in their diagnostic accuracy.

We analyzed data for attending radiologists and residents both in aggregate and in two separate groups, and we found that the results were similar except in the frequencies of substantial disagreement (Fig 3). We believe the residents’ data are clinically relevant because the majority of recent residents and fellows who go into private practice are assigned to reading screening mammograms. Although on may interpret data from attending radiologists and residents differently, inclusion or exclusion of the residents’ data did not alter the findings of this study.

Comparison with Other Observer Study Data
We compared our findings with those of eight other investigations (5,2127) of the effects of CAD on observer performance. By re-analyzing the results of these studies, we deduced general conclusions that are not limited to a particular computer aid or imaging task, as these were different in each of the studies; rather, our conclusions pertain to CAD in general. We used the accuracy indices (Az in all studies except one) and corresponding SDs that were reported in the original investigations as measures of diagnostic accuracy and variability. The results (Table 6) showed that accuracy was always higher and that its SD was always smaller when a computer aid was used; these results indicated that accuracy was consistently improved and that variability was consistently reduced when a computer aid was used. Although these studies were not specifically designed to measure the effect of a computer aid on reader variability, the clear trend of a reduction in reader variability in all of these nonuniformly designed studies indicates that the reduction is likely a consequence of, rather than a coincidental finding with, use of a computer aid.


View this table:
[in this window]
[in a new window]

 
TABLE 6. Effect of CAD on Observer Variability

 
The ability of CAD to improve diagnostic accuracy is conceptually similar to double reading by two radiologists (28), in which gains in accuracy are expected if two radiologists are able to complement each other (29). Several investigators (5,8,2127) suggest that the clinical role of CAD might be to serve as a less expensive alternative to double reading by radiologists. However, to our knowledge, the ability of CAD to reduce variability has not been previously investigated, and we present the first evidence. We believe that the computer aid provides a reference point, much as reading with a skilled partner does. In clinical practice, the variability of the second human reader is one of the major problems that prohibits widespread use of this technique, despite its promised advantages. Because the computer aid is used independent of the radiologists’ interpretations of the mammograms, it can serve as a reference reader that is completely immune to human variability. This could be a unique advantage of CAD when it is compared with other approaches for reducing variability that depend on radiologists’ interpretations, which are subject to the inherent variation in human perception and decision making. CAD also eliminates or reduces the need for arbitration or reconciliation between differing opinions when two human readers disagree, because the course of action is ultimately determined by the radiologist using the added opinion of the computer output. The final clinical decision remains in the hands of a single human reader, and studies (2831) have consistently shown that computer-assisted readers perform at a higher level, with improvement comparable to or exceeding that seen in traditional double-reader studies.

Impediments to the clinical use of CAD include the radiologic community’s underestimation of the extent of individual variability in daily practice and the effects that missing important low-prevalence events or overreacting to common benign conditions has on screening. Recent studies have focused attention on these problems and have created an appreciation of the need for more standardization of the observer’s role in the screening process.

In summary, a CAD joint reading could promote agreement and eliminate some of the extreme or erroneous diagnostic opinions. Both of these outcomes are highly desirable in the medical, social, and economic contexts of breast cancer screening in an asymptomatic population.

Two major conclusions can be drawn from our data and our findings from analysis of nine independent observer-performance studies (5,8,2127): CAD can improve diagnostic performance, and CAD can simultaneously reduce interpretation variability. These beneficial effects are possible because CAD can help radiologists to avoid performing biopsy in benign lesions, while it increases, rather than decreases, the number of correct diagnoses of cancers. The second capability is a substantial enhancement to the known potential of CAD, which has been demonstrated in several studies. Our findings suggest that if CAD is incorporated into clinical radiology, improvements in both accuracy and consistency in image interpretation can be expected. Patients and referring physicians would agree that both of these goals are highly desirable. These goals support the intention of the Breast Imaging Reporting and Data System lexicon introduced by the American College of Radiology and the Mammography Quality Standards Act, that is, to improve the daily practice and results of breast cancer screening by fostering more uniform interpretations.


    ACKNOWLEDGMENTS
 
We thank Charles E. Metz, PhD, for reviewing the manuscript and for use of his LABROC4 program that generated the fitted ROC curves. We thank Carl J. Vyborny, MD, PhD, for his insightful comments.


    FOOTNOTES
 
2 Current address: Dept of Radiology, New York Univ Medical Ctr, NY. Back

3 Current address: Ctr for Statistical Sciences, Brown Univ, Providence, RI. Back

This work was performed as part of the International Digital Mammography Development Group. The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of any of the supporting organizations.

Abbreviations: Az = area under the ROC curve, CAD = computer-aided diagnosis, ROC = receiver operating characteristic

Author contributions: Guarantor of integrity of entire study, Y.J.; study concepts, Y.J., R.M.N., R.A.S., K.D.; study design, Y.J., R.M.N., R.A.S.; literature research, Y.J., A.Y.T.; experimental studies, Y.J.; data acquisition, Y.J.; data analysis/interpretation, Y.J., A.Y.T.; statistical analysis, A.Y.T.; manuscript preparation and editing, Y.J.; manuscript definition of intellectual content, revision/review, and final version approval, all authors.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994; 331:1493-1499.[Abstract/Free Full Text]
  2. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists: findings from a national sample. Arch Intern Med 1996; 156:209-213.[Abstract/Free Full Text]
  3. Schmidt RA, Newstead GM, Linver MN, et al. Mammographic screening sensitivity of general radiologists. In: Karssemeijer N, Thijssen M, Hendriks J, van Erning L, eds. Digital mammography. Dordrecht, the Netherlands: Kluwer Academic, 1998; 383-388.
  4. Kerlikowske K, Grady D, Barclay J, et al. Variability and accuracy in mammographic interpretation using the American College of Radiology Breast Imaging Reporting and Data System. J Natl Cancer Inst 1998; 90:1801-1809.[Abstract/Free Full Text]
  5. Getty DJ, Pickett RM, D’Orsi CJ, Swets JA. Enhanced interpretation of diagnostic images. Invest Radiol 1988; 23:240-252.[CrossRef][Medline]
  6. D’Orsi CJ, Swets JA. Variability in the interpretation of mammograms (letter). N Engl J Med 1995; 332:1172.
  7. Doi K, MacMahon H, Katsuragawa S, Nishikawa RM, Jiang Y. Computer-aided diagnosis in radiology: potential and pitfalls. Eur J Radiol 1999; 31:97-109.[CrossRef][Medline]
  8. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6:22-33.[CrossRef][Medline]
  9. Huo Z, Giger ML, Vyborny CJ, Wolverton DE, Schmidt RA, Doi K. Automated computerized classification of malignant and benign masses on digitized mammograms. Acad Radiol 1998; 5:155-168.[CrossRef][Medline]
  10. Sickles EA. Breast calcifications: mammographic evaluation. Radiology 1986; 160:289-293.[Abstract/Free Full Text]
  11. Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24:234-245.[Medline]
  12. Swets JA, Pickett RM. Evaluation of diagnostic systems: methods from signal detection theory New York, NY: Academic Press, 1982.
  13. Jiang Y, Nishikawa RM, Wolverton DE, et al. Malignant and benign clustered microcalcifications: automated feature analysis and classification. Radiology 1996; 198:671-678.[Abstract/Free Full Text]
  14. Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986; 21:720-733.[Medline]
  15. Swets JA. Measuring the accuracy of diagnostic systems. Science 1988; 240:1285-1293.[Abstract/Free Full Text]
  16. Fleiss JL. Statistical methods for rates and proportions 2nd ed. New York, NY: Wiley, 1981.
  17. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20:37-46.[CrossRef]
  18. Fleiss JL. Measuring nominal scale agreement among many raters. Psych Bull 1971; 76:378-382.[CrossRef]
  19. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33:159-174.[CrossRef][Medline]
  20. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol 1987; 126:161-169.[Free Full Text]
  21. Chan HP, Doi K, Vyborny CJ, et al. Improvement in radiologists’ detection of clustered microcalcifications on mammograms: the potential of computer-aided diagnosis. Invest Radiol 1990; 25:1102-1110.[CrossRef][Medline]
  22. Kegelmeyer WP, Jr, Pruneda JM, Bourland PD, Hillis A, Riggs MW, Nipper ML. Computer-aided mammographic screening for spiculated lesions. Radiology 1994; 191:331-337.[Abstract/Free Full Text]
  23. Chan HP, Sahiner B, Helvie MA, et al. Improvement of radiologists’ characterization of mammographic masses by using computer-aided diagnosis: an ROC study. Radiology 1999; 212:817-827.[Abstract/Free Full Text]
  24. Kobayashi T, Xu XW, MacMahon H, Metz CE, Doi K. Effect of a computer-aided diagnosis scheme on radiologists’ performance in detection of lung nodules on radiographs. Radiology 1996; 199:843-848.[Abstract/Free Full Text]
  25. Difazio MC, MacMahon H, Xu XW, et al. Digital chest radiography: effect of temporal subtraction images on detection accuracy. Radiology 1997; 202:447-452.[Abstract/Free Full Text]
  26. Monnier-Cholley L, MacMahon H, Katsuragawa S, Morishita J, Ishida T, Doi K. Computer-aided diagnosis for detection of interstitial opacities on chest radiographs. AJR Am J Roentgenol 1998; 171:1651-1656.[Abstract/Free Full Text]
  27. Ashizawa K, MacMahon H, Ishida T, et al. Effect of an artificial neural network on radiologists’ performance in the differential diagnosis of interstitial lung disease using chest radiographs. AJR Am J Roentgenol 1999; 172:1311-1315.[Abstract/Free Full Text]
  28. Thurfjell EL, Lernevall KA, Taube AA. Benefit of independent double reading in a population-based mammography screening program. Radiology 1994; 191:241-244.[Abstract/Free Full Text]
  29. Metz CE, Shen JH. Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis. Med Decis Making 1992; 12:60-75.
  30. Beam CA, Sullivan DC, Layde PM. Effect of human variability on independent double reading in screening mammography. Acad Radiol 1996; 3:891-897.[CrossRef][Medline]
  31. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Doi K. Comparison of independent double reading and computer-aided diagnosis (CAD) for the diagnosis of breast lesions (abstr). Radiology 1999; 213(P):323.



This article has been cited by other articles:


Home page
Jpn J Clin OncolHome page
A. Suzuki, Y. Nakamoto, T. Terauchi, M. Kawamoto, Y. Okumura, Y. Suzuki, T. Sato, N. Takahashi, J. Lee, M. Senda, et al.
Inter-observer Variations in FDG-PET Interpretation for Cancer Screening
Jpn. J. Clin. Oncol., August 18, 2007; (2007) hym064v1.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
H. Kato, M. Kanematsu, X. Zhang, M. Saio, H. Kondo, S. Goshima, and H. Fujita
Computer-Aided Diagnosis of Hepatic Fibrosis: Preliminary Evaluation of MRI Texture Analysis Using the Finite Difference Method and an Artificial Neural Network
Am. J. Roentgenol., July 1, 2007; 189(1): 117 - 122.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
P. Skaane, A. Kshirsagar, S. Stapleton, K. Young, and R. A. Castellino
Effect of Computer-Aided Detection on Independent Double Reading of Paired Screen-Film and Full-Field Digital Screening Mammograms
Am. J. Roentgenol., February 1, 2007; 188(2): 377 - 384.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
C. L. Partain, H.-P. Chan, J. G. Gelovani, M. L. Giger, J. A. Izatt, F. A. Jolesz, K. Kandarpa, K. C. P. Li, M. McNitt-Gray, S. Napel, et al.
Biomedical Imaging Research Opportunities Workshop II: Report and Recommendations
Radiology, August 1, 2005; 236(2): 389 - 403.
[Full Text] [PDF]


Home page
RadiologyHome page
E. E. Deurloo, S. H. Muller, J. L. Peterse, A. P. E. Besnard, and K. G. A. Gilhuijs
Clinically and Mammographically Occult Breast Lesions on MR Images: Potential Effect of Computerized Assessment on Clinical Reading
Radiology, March 1, 2005; 234(3): 693 - 701.
[Abstract] [Full Text] [PDF]


Home page
Br. J. Radiol.Home page
K Doi
Current status and future potential of computer-aided diagnosis in medical imaging
Br. J. Radiol., January 1, 2005; 78(suppl_1): S3 - s19.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
S. V. Destounis, P. DiNitto, W. Logan-Young, E. Bonaccio, M. L. Zuley, and K. M. Willison
Can Computer-aided Detection with Double Reading of Screening Mammograms Help Decrease the False-Negative Rate? Initial Experience
Radiology, August 1, 2004; 232(2): 578 - 584.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
E. A. Krupinski
Computer-aided Detection in Clinical Environment: Benefits and Challenges for Radiologists
Radiology, April 1, 2004; 231(1): 7 - 9.
[Full Text] [PDF]


Home page
RadiologyHome page
M. A. Helvie, L. Hadjiiski, E. Makariou, H.-P. Chan, N. Petrick, B. Sahiner, S.-C. B. Lo, M. Freedman, D. Adler, J. Bailey, et al.
Sensitivity of Noncommercial Computer-aided Detection System for Mammographic Breast Cancer Detection: Pilot Clinical Trial
Radiology, April 1, 2004; 231(1): 208 - 214.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
D. M. Ikeda, R. L. Birdwell, K. F. O'Shaughnessy, E. A. Sickles, and R. J. Brenner
Computer-aided Detection Output on 172 Subtle Findings on Normal Mammograms Previously Obtained in Women with Breast Cancer Detected at Follow-Up Screening Mammography
Radiology, March 1, 2004; 230(3): 811 - 819.
[Abstract] [Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
D. Gur, J. H. Sumkin, H. E. Rockette, M. Ganott, C. Hakim, L. Hardesty, W. R. Poller, R. Shah, and L. Wallace
Changes in Breast Cancer Detection and Mammography Recall Rates After the Introduction of a Computer-Aided Detection System
J Natl Cancer Inst, February 4, 2004; 96(3): 185 - 190.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Jiang, Y.
Right arrow Articles by Doi, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jiang, Y.
Right arrow Articles by Doi, K.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
RADIOLOGY RADIOGRAPHICS RSNA JOURNALS ONLINE