Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Potchen, E. J.
Right arrow Articles by Siebert, J. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Potchen, E. J.
Right arrow Articles by Siebert, J. E.
(Radiology. 2000;217:456-459.)
© RSNA, 2000


Thoracic Imaging

Measuring Performance in Chest Radiography1

E. James Potchen, MD, Thomas G. Cooper, MSEE, Arlene E. Sierra, MPA, Gerald R. Aben, MD, Michael J. Potchen, MD, Matthew G. Potter, BS and James E. Siebert, MS

1 From the Department of Radiology, Michigan State University, 164 Radiology Bldg, East Lansing, MI 48824. Received July 6, 1999; revision requested August 9; final revision received February 17, 2000; accepted February 23. Address correspondence to E.J.P. (e-mail: ejp@rad.msu.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
PURPOSE: To use a standardized set of chest radiographs to quantify interobserver differences and to provide a basis for comparing the diagnostic performance of physicians.

MATERIALS AND METHODS: A standardized set of 60 chest radiographs was presented to 162 study participants. Each participant reviewed the radiographs and recorded his or her diagnostic impression by using a fixed five-point scale. These response data were used to generate receiver operating characteristic curves and to establish performance benchmarks. The variations in performance were tested for statistical significance.

RESULTS: Significant interobserver variability was identified during these assessments. The composite group of board-certified radiologists demonstrated performance superior to that of the radiology residents and nonradiologist physicians.

CONCLUSION: By using a receiver operating characteristic approach and a standardized set of chest radiographs, observer accuracy and variability are easily quantified. This approach provides a basis for comparing the diagnostic performance of physicians. When value is measured as a diminution in uncertainty, board-certified radiologists contribute substantial value to the diagnostic imaging system.

Index terms: Diagnostic radiology, observer performance • Receiver operating characteristic (ROC) curve


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Historically, radiologic quality control emphasized the technical components of the diagnostic imaging system. The primary purpose of a quality control system is to improve the quality of the final product; for a diagnostic imaging system, that product is information. Quality control in radiology or in any other system of production requires a reduction in variability and an improvement in diagnostic accuracy. It stands to reason that a principal determinant of variability and accuracy in chest radiography is not the technical component but the performance of the physician (observer) interpreting the image. Any means used to decrease observer variability while maintaining or improving diagnostic accuracy will enhance the quality of chest radiography.

Quantifying the diagnostic performance of radiologists and assessing interobserver variability are necessary steps toward quality improvement. The receiver operating characteristic (ROC) is used in one approach to evaluate the interpretive performance of physicians (15). ROC curves graphically depict the probability of a true-positive interpretation as a function of the probability of a false-positive interpretation. The trade-off between true- and false-positive findings in part represents the choices made by the observer at the threshold of uncertainty.

Historically, the area under this curve has been used to assess the diagnostic accuracy of a test (6,7) or to evaluate the marginal discrimination capacity of alternative imaging modalities (812). The first application involves the analysis of interobserver variability, while the second requires a paired analysis of intraobserver interpretations. The purpose of our study was to use a standardized set of chest radiographs to quantify interobserver differences and to provide a basis for comparing the diagnostic performance of physicians.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Study Participants
The participants for this study were recruited from private practice groups (7) and academic medical groups (6) throughout North America (12) and Great Britain (1). Radiology directors at each site were responsible for selecting the participants in this evaluation. In general, the participants were board-certified radiologists who were assigned to the practice group. In a limited number of instances, the radiology director also included residents and nonradiologist physicians.

The entire study included 111 board-certified radiologists, 29 radiology residents, and 22 nonradiologist physicians recruited from 10 geographically diverse sites. Mean radiologic experience of the board-certified radiologists was 17.8 years, with a range of 4–38 years. Mean training length of the radiology residents was 2.4 years, with a range of 1–4 years. Nonradiologists were physicians from the medical specialties of family practice (n = 6), emergency medicine (n = 3), radiation oncology (n = 6), pulmonology (n = 4), and internal medicine (n = 3). All participants provided informed written consent prior to enrollment in this study.

Image Test Set
A set of 60 posteroanterior screening chest radiographs was developed from various radiologic archives. Thirty of these chest radiographs contained one confirmed clinically important finding. A clinically important finding required an in-depth review of the patient’s medical records and independent confirmation by means of an alternative diagnostic test (eg, biopsy). The clinically important finding was selected to reflect the range of diagnostic subtleties encountered in a standard radiologic clinical practice. In general, this subset consisted of examples of infiltrates, pneumothoraces, cardiac abnormalities, metastases, and other masses.

The remaining 30 normal chest radiographs had been obtained in asymptomatic patients during a required annual physical examination. These chest radiographs were originally interpreted as normal. In addition, these same patients underwent a subsequent annual physical examination, including acquisition of a chest radiograph that was also interpreted as normal. The 30 patients remained asymptomatic for 2 years following the original chest radiograph. Figure 1 demonstrates examples of a clinically important finding and a normal chest radiograph used in this test set.



View larger version (136K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1a. Example posteroanterior chest radiographs from the standardized set demonstrate (a) normal findings and (b) a right middle lobe carcinoma (arrow).

 


View larger version (138K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1b. Example posteroanterior chest radiographs from the standardized set demonstrate (a) normal findings and (b) a right middle lobe carcinoma (arrow).

 
Only reproductions of the original radiographs were used for this study. The original radiographs were digitized on a Lumiscan 150 scanner (Lumisys, Sunnyvale, Calif) and stored in a dedicated Digital Imaging and Communications in Medicine, or DICOM, image database. Specially developed in-house software (FOLIOSHOP) was used to carefully select window and level settings for each image. Each image was printed on a laser imager (Dryview 8700; 3M, St Paul, Minn). This process permitted the images to be accurately reproduced for use if the original radiographs became worn or excessively soiled.

Study Design
Each study participant (observer) reviewed each set of 60 chest radiographs in a single viewing session. The observer recorded his or her diagnostic impression of each radiograph by using a five-point fixed scale that reflected the observer’s confidence in image interpretation (Table 1). The important characteristic of this scale was that higher numbers reflected an increased level of observer confidence that a radiograph contained abnormal findings. The observer also recorded instructions concerning the software interface and recording of data. There was no time limit to complete the assessment.


View this table:
[in this window]
[in a new window]

 
TABLE 1. Five-Point Diagnostic Impression Scale Used to Rate Chest Radiographs
 
On completion of the viewing session, the data files were downloaded from laptop computers to a specially designed SEQUEL database on a Unix (Sybase, Emeryville, Calif) workstation for data analysis. Although observers were treated anonymously, they were provided feedback on their performance compared with that of the practice group and all other participants in the study. This feedback was cited in a table expressing the individual’s performance (in terms of SDs from the mean or z score) relative to diagnostic accuracy (area under ROC curve), and false-positive, false-negative, and ambiguity rates (fractional proportion of radiographs interpreted as a 3 on a five-point scale) (Table 1).

Numeric Analysis
An ROC curve was generated, and the area under this curve was calculated for each observer by using the maximum-likelihood parameter estimation technique of Dorfman and Alf (13,14). The primary assumption of this approach was that the probability function describing a particular radiologist’s confidence in a positive diagnosis based on an interpretation of a chest radiograph was described by two overlapping normal (binormal) distributions. These distributions are presumed to have independent means (mean1 and mean2) and variances (variance1 and variance2). The intrinsic diagnostic accuracy of a radiologist is determined by the variances of these distributions and their separation (difference between means). In addition, each radiologist exhibits a positivity threshold above which the radiologist will interpret a radiograph as containing abnormal findings. This positivity threshold may be modified to adjust false-positive and false-negative interpretations, but it will not affect the radiologist’s performance to distinguish between normal and abnormal radiographs. Figure 2 is a graphic depiction of the binormal distribution with an arbitrary positivity threshold.



View larger version (27K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 2. Graph depicts the binormal probability distribution underlying ROC computation. Observers were able to modify the decision threshold without affecting the ROC curve or diagnostic accuracy. µ = mean, {sigma} = standard deviation.

 
The maximum-likelihood technique is used to map the binormal distributions into a linear normal-deviate space and to determine a best-fit line. The following two parameters characterize this line in the transform space: a slope, which is the ratio of the standard deviations (SD1/SD2), and an intercept, which is the difference in the means of the distributions divided by the standard deviation of the positive radiograph subset ([mean2 - mean1]/SD2). In general, as the intercept of the line increases, the computed diagnostic accuracy increases, which reflects a larger separation between the two normal distributions and a greater capacity to discriminate normal from abnormal radiographs.

Five groups were established for the participants in this study: top 20 board-certified radiologists, bottom 20 board-certified radiologists (whose performance served as the threshold performance by radiologists compared with that of other physicians), all board-certified radiologists, radiology residents, and other physicians. The groups of top 20 board-certified radiologists and bottom 20 board-certified radiologists were segmented on the basis of the area under their respective ROC curves. Composite ROC curves and the area under the ROC curve were computed for each participant group. Differences in the area under the ROC curve were tested for statistical significance.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Figure 3 illustrates the ROC curves for each of the five groups of study participants. The top 20 radiologists were selected from the composite group of radiologist participants on the basis of superior performance; this group exhibited the preferred ROC curve, which demonstrates a benchmark of quality that radiologists can attain. In the remaining four groups, the results suggested that the group comprising solely board-certified radiologists had the highest performance. The group of radiology residents outperformed both the group of bottom 20 board-certified radiologists and the group of nonradiologist physicians. Nonradiologist physicians exhibited the poorest performance.



View larger version (44K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 3. Graph depicts the composite ROC curves of five participant groups. Top and bottom 20 radiologists were categorized on the basis of the area under the ROC curve.

 
These ROC curves graphically depict decision performance trade-offs for various levels of observer confidence but do not substantiate statistically significant variations in performance. Variations were demonstrated by using a statistical analysis of the area under each curve; this technique was more reflective of the diagnostic accuracy of each group. These data, summarized in Table 2, which presents the five groups of observers, the number of observers included in each group, and the area under each respective group’s ROC curve (mean ± SD). Again, these data demonstrated the performance rankings among the five groups that were included in this observer performance assessment. The differences between each group were all statistically significant (P < .001).


View this table:
[in this window]
[in a new window]

 
TABLE 2. Computed Areas under the ROC Curves (Diagnostic Accuracy) for Five Groups of Participants
 
As stated previously, the underlying assumption of the numeric analysis was that two overlapping normal distributions characterize a radiologist’s ability to discriminate between positive and negative radiographs. We also evaluated the implied variance of these distributions to characterize the consistency of positive and negative interpretations. The top 20 board-certified radiologists demonstrated significantly less variability in the interpretation of normal radiographs than did any of the remaining groups. This result suggests that one underlying factor contributing to the superior performance of these radiologists may have been their ability to identify normality more consistently than did their peers (confidence parameter). It is impossible, however, to make absolute comparisons between the groups since the variance in the distributions of normal and abnormal radiographs is not known explicitly.

The top 20 board-certified radiologists also had a significantly larger separation between the implied means of the two normal distributions (relative to the variance of the normal radiographs) than did the other participants (discrimination parameter). Table 3 summarizes the results for the confidence and discrimination parameters (mean ± SD).


View this table:
[in this window]
[in a new window]

 
TABLE 3. Computed Confidence and Discrimination Parameters for Study Participants
 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In a clinical milieu, numerous factors contribute to interobserver variability and to the diagnostic accuracy of physicians who interpret radiographs. Some of these factors include individual medical training, the image-viewing environment, and access to previous radiographs and/or pertinent clinical information. To assess the potential value that a physician adds to the diagnostic imaging process, it is reasonable to assume that an evaluation must be unbiased and must isolate the native performance of the physician from these confounding factors. The presentation of a standardized set of chest radiographs viewed in a consistent environment affords a uniform basis for such an evaluation. The use of the ROC curve provides an unbiased approach to evaluate the accuracy of the physician interpreting this set of chest radiographs. This approach allows each observer to compare his or her inherent performance with that of others who observed the same set of images.

The data presented in this study clearly demonstrate substantial variability in the performance of radiologists interpreting a standardized set of chest radiographs. Despite this wide range of diagnostic performance, two key elements emerge that characterize the top-performing radiologists. First, these highest-performing individuals demonstrated less variability in the interpretation of normal radiographs relative to abnormal radiographs than did their counterparts. This finding suggests that an important component of self-improvement is studying and understanding the range of normalcy in chest radiography. Second, the top-performing radiologists were more confident in their interpretations than were their peers. Both of these parameters interplay to produce ROC curves that reflect high diagnostic accuracy.

With respect to all study participants, board-certified radiologists as a group demonstrated a higher level of diagnostic accuracy than did either radiology residents or nonradiologist physicians. This finding clearly demonstrates the value that radiologists add to the diagnostic imaging system. Presumably, the improved performance of board-certified radiologists relative to that of radiology residents is due to increased education, training, and experience. Perhaps repeated assessments during residency could be used to provide valuable feedback to radiology residents and could be used as a means to quantify improvements in performance during medical training.

Like most professionals, physicians earn their living by making decisions under conditions of uncertainty. All decisions made under these conditions have error rates. At the threshold of uncertainty, an individual can err on the side of making false-positive (risk-averse individual) or false-negative (risk-taking individual) decisions. While the risk preference of an individual may influence his or her personal threshold for rendering positive findings, it does not influence his or her performance in distinguishing normal chest radiographs from abnormal chest radiographs (15).

The assessment of multiple abnormalities presented in this study merely documents the performance of the physician interpreting the chest radiographs. It is still unclear if the feedback afforded by this assessment will lead to improved individual diagnostic performance. However, a set of standardized radiographs viewed in a well-controlled setting can be used to distinguish variation in the performance of individual observers and groups of observers. The measurement and documentation of diagnostic performance is a necessary step to quality improvement.


    FOOTNOTES
 
Abbreviation: ROC = receiver operating characteristic

Author contributions: Guarantors of integrity of entire study, E.J.P., A.E.S.; study concepts and design, all authors; definition of intellectual content, E.J.P.; literature research, E.J.P., T.G.C.; clinical studies, E.J.P., A.E.S., G.R.A., M.J.P.; data acquisition, T.G.C., A.E.S., G.R.A., M.J.P., M.G.P., J.E.S.; data analysis, E.J.P., T.G.C.; statistical analysis, T.G.C.; manuscript preparation and editing, E.J.P., T.G.C.; manuscript review, all authors.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Kundel HL. Medical image perception. Acad Radiol 1995; 2(suppl 2):S108-S110.
  2. Hanley JA, McNeil BJ. The meaning and use of the area under the ROC curve. Radiology 1982; 143:29-36.[Abstract/Free Full Text]
  3. Kundel HL. Images, image quality and observer performance: New Horizons in Radiology Lecture. Radiology 1979; 132:265-271.[Abstract]
  4. Berbaum KS, Dorfman DD, Franken EA. Measuring observer performance by ROC analysis: indications and complications. Invest Radiol 1989; 24:228-233.[Medline]
  5. Slasky BS, Gur D, Good WF, et al. Receiver operating characteristic analysis of chest image interpretation with conventional, laser-printed, and high-resolution workstation images. Radiology 1990; 174:775-780.[Abstract/Free Full Text]
  6. Winer Muram HT, Arheart KL, Jennings SG, Rubin SA, Kauffman WM, Slobod KS. Pulmonary complications in children with hematologic malignancies: accuracy of diagnosis with chest radiography and CT. Radiology 1997; 204:643-649.[Abstract/Free Full Text]
  7. Lee MG, Baker ME, Sostman HD, et al. The diagnostic accuracy/efficacy of MRI in differentiating hepatic hemangiomas from metastatic colorectal/breast carcinoma: a multiple reader ROC analysis using a jackknife technique. J Comput Assist Tomogr 1996; 20:905-913.[Medline]
  8. Elam EA, Rehm K, Hillman B, Maloney K, Fajardo LL, McNeill K. Efficacy of digital radiography for the detection of pneumothorax: comparison with conventional chest radiography. AJR Am J Roentgenol 1992; 158:509-514.[Abstract/Free Full Text]
  9. Scott WW, Jr, Rosenbaum JE, Ackerman SJ, et al. Subtle orthopedic fractures: teleradiology workstation versus film interpretation. Radiology 1993; 187:811-815.[Abstract/Free Full Text]
  10. Chan H, Vyborny CJ, Macmahon H, Metz CE, Doi K, Sickles EA. Digital mammography ROC studies of the effects of pixel size and unsharp-mask filtering on the detection of subtle microcalcifications. Invest Radiol 1987; 22:581-589.[Medline]
  11. Lams OM, Cocklin ML. Spatial resolution requirements for digital chest radiographs: an ROC study of observer performance in selected cases. Radiology 1986; 158:11-19.[Abstract/Free Full Text]
  12. Razavi M, Sayre JW, Taira RK, et al. Receiver-operating-characteristic study of chest radiographs in children: digital hard-copy film vs 2K x 2K soft-copy images. AJR Am J Roentgenol 1992; 158:443-448.[Abstract/Free Full Text]
  13. Dorfman DD, Alf E, Jr. Maximum likelihood estimation of parameters of signal detection theory: a direct solution. Psychometrika 1968; 33:117-124.[Medline]
  14. Dorfman DD, Alf E, Jr. Maximum-likelihood estimation of parameters of signal detection theory and determination of confidence intervals: rating-method data. J Math Psych 1969; 6:487-496.
  15. Metz CE. ROC methodology in radiological imaging. Invest Radiol 1986; 21:234-245.[Medline]



This article has been cited by other articles:


Home page
BMJHome page
Y Balabanova, R Coker, I Fedorin, S Zakharova, S Plavinskij, N Krukov, R Atun, and F Drobniewski
Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study
BMJ, August 13, 2005; 331(7513): 379 - 382.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
L. Monnier-Cholley, F. Carrat, B. P. Cholley, J.-M. Tubiana, and L. Arrive
Detection of Lung Cancer on Radiographs: Receiver Operating Characteristic Analyses of Radiologists', Pulmonologists', and Anesthesiologists' Performance
Radiology, December 1, 2004; 233(3): 799 - 805.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
B. E. Kouri, R. G. Parsons, and H. R. Alpert
Physician Self-Referral for Diagnostic Imaging: Review of the Empiric Literature
Am. J. Roentgenol., October 1, 2002; 179(4): 843 - 850.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Potchen, E. J.
Right arrow Articles by Siebert, J. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Potchen, E. J.
Right arrow Articles by Siebert, J. E.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
RADIOLOGY RADIOGRAPHICS RSNA JOURNALS ONLINE