|
|
||||||||
Thoracic Imaging |
1 From the Department of Radiology, Michigan State University, 164 Radiology Bldg, East Lansing, MI 48824. Received July 6, 1999; revision requested August 9; final revision received February 17, 2000; accepted February 23. Address correspondence to E.J.P. (e-mail: ejp@rad.msu.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: A standardized set of 60 chest radiographs was presented to 162 study participants. Each participant reviewed the radiographs and recorded his or her diagnostic impression by using a fixed five-point scale. These response data were used to generate receiver operating characteristic curves and to establish performance benchmarks. The variations in performance were tested for statistical significance.
RESULTS: Significant interobserver variability was identified during these assessments. The composite group of board-certified radiologists demonstrated performance superior to that of the radiology residents and nonradiologist physicians.
CONCLUSION: By using a receiver operating characteristic approach and a standardized set of chest radiographs, observer accuracy and variability are easily quantified. This approach provides a basis for comparing the diagnostic performance of physicians. When value is measured as a diminution in uncertainty, board-certified radiologists contribute substantial value to the diagnostic imaging system.
Index terms: Diagnostic radiology, observer performance Receiver operating characteristic (ROC) curve
| INTRODUCTION |
|---|
|
|
|---|
Quantifying the diagnostic performance of radiologists and assessing interobserver variability are necessary steps toward quality improvement. The receiver operating characteristic (ROC) is used in one approach to evaluate the interpretive performance of physicians (15). ROC curves graphically depict the probability of a true-positive interpretation as a function of the probability of a false-positive interpretation. The trade-off between true- and false-positive findings in part represents the choices made by the observer at the threshold of uncertainty.
Historically, the area under this curve has been used to assess the diagnostic accuracy of a test (6,7) or to evaluate the marginal discrimination capacity of alternative imaging modalities (812). The first application involves the analysis of interobserver variability, while the second requires a paired analysis of intraobserver interpretations. The purpose of our study was to use a standardized set of chest radiographs to quantify interobserver differences and to provide a basis for comparing the diagnostic performance of physicians.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The entire study included 111 board-certified radiologists, 29 radiology residents, and 22 nonradiologist physicians recruited from 10 geographically diverse sites. Mean radiologic experience of the board-certified radiologists was 17.8 years, with a range of 438 years. Mean training length of the radiology residents was 2.4 years, with a range of 14 years. Nonradiologists were physicians from the medical specialties of family practice (n = 6), emergency medicine (n = 3), radiation oncology (n = 6), pulmonology (n = 4), and internal medicine (n = 3). All participants provided informed written consent prior to enrollment in this study.
Image Test Set
A set of 60 posteroanterior screening chest radiographs was developed from various radiologic archives. Thirty of these chest radiographs contained one confirmed clinically important finding. A clinically important finding required an in-depth review of the patients medical records and independent confirmation by means of an alternative diagnostic test (eg, biopsy). The clinically important finding was selected to reflect the range of diagnostic subtleties encountered in a standard radiologic clinical practice. In general, this subset consisted of examples of infiltrates, pneumothoraces, cardiac abnormalities, metastases, and other masses.
The remaining 30 normal chest radiographs had been obtained in asymptomatic patients during a required annual physical examination. These chest radiographs were originally interpreted as normal. In addition, these same patients underwent a subsequent annual physical examination, including acquisition of a chest radiograph that was also interpreted as normal. The 30 patients remained asymptomatic for 2 years following the original chest radiograph. Figure 1 demonstrates examples of a clinically important finding and a normal chest radiograph used in this test set.
|
|
Study Design
Each study participant (observer) reviewed each set of 60 chest radiographs in a single viewing session. The observer recorded his or her diagnostic impression of each radiograph by using a five-point fixed scale that reflected the observers confidence in image interpretation (Table 1). The important characteristic of this scale was that higher numbers reflected an increased level of observer confidence that a radiograph contained abnormal findings. The observer also recorded instructions concerning the software interface and recording of data. There was no time limit to complete the assessment.
|
Numeric Analysis
An ROC curve was generated, and the area under this curve was calculated for each observer by using the maximum-likelihood parameter estimation technique of Dorfman and Alf (13,14). The primary assumption of this approach was that the probability function describing a particular radiologists confidence in a positive diagnosis based on an interpretation of a chest radiograph was described by two overlapping normal (binormal) distributions. These distributions are presumed to have independent means (mean1 and mean2) and variances (variance1 and variance2). The intrinsic diagnostic accuracy of a radiologist is determined by the variances of these distributions and their separation (difference between means). In addition, each radiologist exhibits a positivity threshold above which the radiologist will interpret a radiograph as containing abnormal findings. This positivity threshold may be modified to adjust false-positive and false-negative interpretations, but it will not affect the radiologists performance to distinguish between normal and abnormal radiographs. Figure 2 is a graphic depiction of the binormal distribution with an arbitrary positivity threshold.
|
Five groups were established for the participants in this study: top 20 board-certified radiologists, bottom 20 board-certified radiologists (whose performance served as the threshold performance by radiologists compared with that of other physicians), all board-certified radiologists, radiology residents, and other physicians. The groups of top 20 board-certified radiologists and bottom 20 board-certified radiologists were segmented on the basis of the area under their respective ROC curves. Composite ROC curves and the area under the ROC curve were computed for each participant group. Differences in the area under the ROC curve were tested for statistical significance.
| RESULTS |
|---|
|
|
|---|
|
|
The top 20 board-certified radiologists also had a significantly larger separation between the implied means of the two normal distributions (relative to the variance of the normal radiographs) than did the other participants (discrimination parameter). Table 3 summarizes the results for the confidence and discrimination parameters (mean ± SD).
|
| DISCUSSION |
|---|
|
|
|---|
The data presented in this study clearly demonstrate substantial variability in the performance of radiologists interpreting a standardized set of chest radiographs. Despite this wide range of diagnostic performance, two key elements emerge that characterize the top-performing radiologists. First, these highest-performing individuals demonstrated less variability in the interpretation of normal radiographs relative to abnormal radiographs than did their counterparts. This finding suggests that an important component of self-improvement is studying and understanding the range of normalcy in chest radiography. Second, the top-performing radiologists were more confident in their interpretations than were their peers. Both of these parameters interplay to produce ROC curves that reflect high diagnostic accuracy.
With respect to all study participants, board-certified radiologists as a group demonstrated a higher level of diagnostic accuracy than did either radiology residents or nonradiologist physicians. This finding clearly demonstrates the value that radiologists add to the diagnostic imaging system. Presumably, the improved performance of board-certified radiologists relative to that of radiology residents is due to increased education, training, and experience. Perhaps repeated assessments during residency could be used to provide valuable feedback to radiology residents and could be used as a means to quantify improvements in performance during medical training.
Like most professionals, physicians earn their living by making decisions under conditions of uncertainty. All decisions made under these conditions have error rates. At the threshold of uncertainty, an individual can err on the side of making false-positive (risk-averse individual) or false-negative (risk-taking individual) decisions. While the risk preference of an individual may influence his or her personal threshold for rendering positive findings, it does not influence his or her performance in distinguishing normal chest radiographs from abnormal chest radiographs (15).
The assessment of multiple abnormalities presented in this study merely documents the performance of the physician interpreting the chest radiographs. It is still unclear if the feedback afforded by this assessment will lead to improved individual diagnostic performance. However, a set of standardized radiographs viewed in a well-controlled setting can be used to distinguish variation in the performance of individual observers and groups of observers. The measurement and documentation of diagnostic performance is a necessary step to quality improvement.
| FOOTNOTES |
|---|
Author contributions: Guarantors of integrity of entire study, E.J.P., A.E.S.; study concepts and design, all authors; definition of intellectual content, E.J.P.; literature research, E.J.P., T.G.C.; clinical studies, E.J.P., A.E.S., G.R.A., M.J.P.; data acquisition, T.G.C., A.E.S., G.R.A., M.J.P., M.G.P., J.E.S.; data analysis, E.J.P., T.G.C.; statistical analysis, T.G.C.; manuscript preparation and editing, E.J.P., T.G.C.; manuscript review, all authors.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
Y Balabanova, R Coker, I Fedorin, S Zakharova, S Plavinskij, N Krukov, R Atun, and F Drobniewski Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study BMJ, August 13, 2005; 331(7513): 379 - 382. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Monnier-Cholley, F. Carrat, B. P. Cholley, J.-M. Tubiana, and L. Arrive Detection of Lung Cancer on Radiographs: Receiver Operating Characteristic Analyses of Radiologists', Pulmonologists', and Anesthesiologists' Performance Radiology, December 1, 2004; 233(3): 799 - 805. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Kouri, R. G. Parsons, and H. R. Alpert Physician Self-Referral for Diagnostic Imaging: Review of the Empiric Literature Am. J. Roentgenol., October 1, 2002; 179(4): 843 - 850. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |