|
|
||||||||
Thoracic Imaging |
1 From the Departments of Radiology (L.M.C., J.M.T., L.A.) and Public Health (F.C.), Hôpital Saint-Antoine, 184 rue du Faubourg Saint-Antoine, 75012 Paris, France; and Department of Anesthesiology, Hôpital Lariboisière, Paris, France (B.P.C.). Received September 15, 2003; revision requested November 11; final revision received April 29, 2004; accepted May 26. Address correspondence to L.M.C. (e-mail: laurence.monnier-cholley@sat.ap-hop-paris.fr).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: The study was approved by the institutional review board, and informed consent was not required or obtained for review of radiographs. A set of 60 posteroanterior chest radiographs was presented to 36 observers: 12 radiologists, 12 pulmonologists, and 12 anesthesiologists. Each of these three observer categories included six residents and six staff. Thirty of the radiographs each depicted one lung cancer that was overlooked at prospective image interpretation; the other 30 were normal radiographs matched for age and smoking history. Observers were asked to rate their degree of suspicion concerning the presence of lung cancer by using a visual analog scale and to point out the zone of suspicion on a schematic of the lung. These data were used to generate combined ROClocalization ROC curves and to assess performance. Intraobserver consistency was evaluated by using intraclass correlation coefficients and weighted
statistics.
RESULTS: Areas under the ROC curves indicated better performance for radiologists and pulmonologists compared with anesthesiologists (P < .002) and for staff compared with residents (P < .022). Performance was lower for all categories of observers when localization ROC curves were used. Radiologists and staff pulmonologists showed a higher degree of confidence in the assessment of normality than did other categories of physicians. Intraobserver consistency was poor.
CONCLUSION: Experienced readers showed better ability to distinguish normality from abnormality. Combined ROC and localization ROC analyses gave a more reliable quantification of observer performance than did ROC analysis alone.
© RSNA, 2004
Index terms: Cancer screening, 68.11 Diagnostic radiology, observer performance Images, interpretation Lung neoplasms, 60.321 Receiver operating characteristic (ROC) curve
| INTRODUCTION |
|---|
|
|
|---|
ROC analysis alone, however, has a number of limitations that may result in gross overestimation of performance. Indeed, ROC analysis takes into consideration only the relative likelihood that a given image is abnormal, and not the location of a suspicious finding. Thus, a reader might appropriately classify a radiograph as abnormal but, when asked to show the lesion, might point out a normal feature while overlooking the true abnormality. For this reason, it has been suggested that the location of suspicious lesions should be taken into account to obtain a better estimation of performance accuracy (7). To our knowledge, however, localization ROC analysis was not applied previously to performance evaluation in the radiographic detection of lung cancer.
The purpose of our study was to compare and quantify, by means of ROC and localization ROC analyses, the performance of radiologists, pulmonologists, and anesthesiologists (residents and staff) in the detection of missed lung cancer.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Image Test Set
A set of 60 posteroanterior chest radiographs was used in the study. Thirty of these radiographs each showed one lung cancer that was missed at initial (prospective) image interpretation. Characteristics of this set of images have previously been described in detail (8) and are summarized here. The cases were gathered during weekly lung conferences. They derived mostly from community practice and were therefore interpreted initially by general radiologists. Patients included 21 men and nine women with a mean age (±standard deviation) of 65 years ± 11 (range, 4286 years) and a smoking history of 41.1 pack-years ± 24.9. Each case included only one primary lung carcinoma, which was histologically proved. The visibility of the lesion was confirmed retrospectively in consensus by two independent radiologists with extensive experience in chest radiology (L.A., with 18 years of experience; and another staff member, with 20 years of experience). Median diameter of the lesions was 2 cm (25th percentile, 1.5 cm; 75th percentile, 2.9 cm; range, 17 cm). Overlying anatomy was complex in 26 (87%) of 30 cases. Complex overlying structures included at least two of the following: bone structure, hilum, vessel, and mediastinum. Distracting features (apical scar, granuloma, pleural disease, sternotomy wire, previous lobectomy, bronchiectasis) were present on 19 (63%) of 30 images. On the basis of nodule contrast, overlying anatomy, and distracting structures, the two radiologists in consensus estimated the overall difficulty of lesion detection to be difficult on 12 images, moderately difficult on 11 images, and easy on seven images.
The remaining 30 radiographs were considered normal and had been obtained from asymptomatic patients selected from the outpatient chest clinic who were matched for age and smoking history with the lung cancer patients. Normality was established by the same two radiologists on the basis of the absence of a known malignancy and the stability of chest radiographic findings for 2 years. Radiographs that showed granulomas, benign nodular lesions, and metastases were excluded from the study because they were considered equivocal. Nevertheless, radiographs that showed abnormalities that were stable over time and did not look nodular (old scars, peribronchial thickening, enlarged pulmonary arteries, emphysema) were included in the group of normal radiographs. The overall image set was therefore considered challenging.
Original radiographs that had been obtained with various standard screen-film systems were used for interpretation. For designation of the zone of suspicion, a schematic of the lungs was used on which the lungs were divided into 12 zones (Fig 1) such that each lung comprised three parts of equal height (upper, middle, and lower) and each part was divided into lateral and medial zones of equal width.
|
Observer Test Design
The observer test consisted of two parts (parts 1 and 2). In part 1, all 60 cases were read during two sessions of 30 minutes each to avoid observer fatigue. Normal and abnormal cases were intermixed in random order. During each session, the observers were shown 30 images and knew that they were to search for lung cancer that had been initially missed. The time limit was 1 minute per case, for consistency in temporal conditions between readings and to avoid a situation in which a reader might find a nodule after looking at a radiograph for an unusually long time. Readers were told that each abnormal radiograph showed only one lung cancer, but they did not know the percentage of abnormal radiographs and were not given any clinical information (eg, age or smoking history). For each case, the observers were asked to mark only one location and to rate their degree of suspicion concerning the presence of a primary lung cancer by using a visual analog scale. On this continuous scale from 0 to 100, a score of 0 corresponded to no suspicion (ie, certainty that the radiograph was normal), and that of 100, to no doubt that lung cancer was present. A score in the middle range (>40 to 60) was therefore considered ambiguous. The observers also were asked to indicate the zone of suspicion on a schematic of the lung. For radiographs regarded as normal (ie, scored 0), the observers were instructed to point out one zone and were asked to choose (forced to localize) the most suspicious zone (9).
Part 2 of the study was performed to analyze intraobserver consistency in reading chest radiographs. In this part of the study, half of the radiographs were reviewed during a single session by half of the observers randomly selected (three residents and three staff in each specialty), at least 6 weeks after completion of part 1 of the study. The 6-week interval was imposed to avoid recognition bias. The observers again were asked to rate their degree of suspicion with the same scoring system as was used in part 1 and to indicate the zone of suspicion on the schematic of the lung.
Statistical Analysis
Combined ROC and localization ROC analyses were performed as described by Swensson (9). First, the true-positive rate (ROC curve) and the true-positive rate with correct localization of lesions (localization ROC curve) were plotted as functions of the corresponding false-positive rate for each range of scores (020, >20 to 40, >40 to 60, >60 to 80, and >80 to 100) on the visual analog scale used to rate level of suspicion. Second, a distribution-free model (the so-called concurrent fit model described by Swensson) was simultaneously fitted to the two curves. Third, ROC Az and localization ROC Az values were estimated as measures of global performance. In addition, the percentage of correct localizations (PCL) among true-positive cases was calculated.
We also calculated maximum likelihood estimates of the intercept a and the slope b of the ROC curve by using the standard binormal statistical ROC model, with values plotted on a normal deviate axis, as follows: a = (m2 m1)/d2, and b = d1/d2, where m1 and d1 represent the mean and standard deviation, respectively, of the score attributed to normal radiographs, and m2 and d2 represent the mean and standard deviation, respectively, of the score attributed to abnormal radiographs. M1, m2, and a were used to evaluate the ability of observers in each category to discriminate normal radiographs from abnormal radiographs. A lower score attributed to normal cases (low m1) corresponded to a higher degree of confidence in detection of normality, and a higher score attributed to abnormal cases (high m2) corresponded to a higher degree of confidence in detection of abnormality. Thus, the better the observers ability to discriminate between normal and abnormal radiographs, the higher the value of a (1).
ROC Az, localization ROC Az, and PCL values were compared by using pseudovalues with the jackknife method developed by Dorfman et al (10). This method, which permits generalization both for the population of readers and the sample of patients and which is applicable to all of the usual ROC indexes, was used as follows: For each reader, maximum likelihood estimates of ROC Az, localization ROC Az, and PCL were obtained with 60 fitting of the concurrent fit model to subsamples composed of 59 radiographs each. Each subsample was obtained by omitting each radiograph (i) in turn from the whole sample. Pseudovalues were then obtained by subtracting the estimates weighted with a scaling factor of 59/60 from the maximum likelihood estimates obtained for the whole sample of radiographs, as follows: pseudo Az(i) = Az(w) [(59/60) · Az(i)], where pseudo Az(i) is the pseudo Az value for radiograph i, Az(w) is the maximum likelihood estimate of the Az value for the whole sample, and Az(i) is the estimated Az value for the subsample in which radiograph i is not included. A total of 2160 (36 x 60) pseudovalues were generated.
A linear mixed analysis-of-variance model was developed in which the response variable was the pseudovalue, fixed effects were observer specialty and rank (years of experience), and random effects were patients (cases) and readers. Interaction between fixed effects was not statistically significant and therefore was omitted from the final analysis. The F test was used to assess the significance of fixed effects. Adjustments for multiple comparisons were performed by using the Scheffé test.
The algorithm used for ROC and localization ROC data fitting in our distribution-free model was developed by using commercially available software (Mathematica, version 3.0; Wolfram Research, Champaign, Ill). Linear mixed analysis of variance also was performed by using software (SAS, version 8.2; SAS Institute, Cary, NC).
The distribution of false-positive localizations was evaluated only on abnormal radiographs because their distribution on normal radiographs was biased because of forced localization (ie, because observers were required to indicate a zone of suspicion even if they had no doubt that the radiograph was normal).
Intraobserver consistency was analyzed by comparing confidence scores and suspicious zones between parts 1 and 2 of the study and deriving intraclass and Pearson correlation coefficients. Since confidence scores were obtained by using a visual analog scale ranging from 0 to 100, there was a very low probability of an observer assigning exactly the same score twice to the same radiograph. To obtain a better measurement of agreement between the two readings, we looked for concordance between scores, which was indicated by a difference of 10 points or fewer. Categorical data for zones of suspicion in parts 1 and 2 of the study were compared by using weighted
statistics.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
= 0.60; 95% confidence interval: 0.51, 0.66), and again no significant difference was noted between observer categories according to rank or specialty.
|
| DISCUSSION |
|---|
|
|
|---|
Missing a lung cancer has always been a fear of every radiologist in daily practice. Several investigators have studied this problem and described the characteristics of missed lung cancers (2,3,5,8). ROC analysis has been used to evaluate observer performance in reading chest radiographs and other radiographs in previous studies (1,6,11), but we did not find any reported study with a focus on challenging lung cancer cases. Investigators in these studies reported a range of ROC Az values of 0.650.95. Radiologists performance (Az = 0.750.95) was always better than that of other physicians (Az = 0.650.70) (1,11). The great variability in results can be explained by several factors related to the selection of cases and observers.
Concerning case selection, Shiraishi et al (6) have shown that observer performance is related to the degree of subtlety of cases. ROC Az of 0.75 was previously reported for radiologists searching for very subtle lung nodules. In the present study, the relatively poor performance of radiologists (ROC Az = 0.770) can therefore be explained in part by the difficulty of the cases (all had been missed at initial interpretation, and normal radiographs were considered challenging). A second important characteristic of our study is that asymptomatic patients were matched for age and smoking history with lung cancer patients. As a consequence, their radiographs were not strictly normal, and assessment of normality was very challenging in most cases. In previous studies, the normal population either was not described (6,11) or was composed of randomly selected asymptomatic patients undergoing systematic screening (1). In comparison with the radiographs of lung cancer patients, who are usually heavy smokers, strictly normal radiographs are quite easy to identify. Such a bias increases ROC Az and improves observer performance.
Concerning observer selection, Potchen et al (1) demonstrated a correlation between the level of training and the performance of observers. This finding is in agreement with the results of our study, in which staff performed better than residents (ROC Az of 0.740 vs 0.707). Although it is impossible to compare observer experience between different studies, we can assume that the difference in performance might be partly attributable to observer selection.
The relatively poor performance of observers in our study also could be explained in part by the limitation of reading time to 1 minute. In previous studies, reading time was either unspecified or unlimited (1,6,11). An important and striking finding was that observer performance dropped dramatically when lesion localization was considered. To our knowledge, observer performance has never been studied by using localization ROC curves for detection of real nodular lesions. Performance accuracy, however, is better estimated with combined ROClocalization ROC curves than with ROC curves alone (7,9,12). In Swenssons study (7), lung nodules were artificially generated by placing synthetic phantoms over a normal chest radiograph, and detection was easier because the overall appearance of the chest radiograph was normal, compared with that of a radiograph obtained in a smoker. Goin et al (13) evaluated the usefulness of adding localization data in mammographic interpretation and found that the localization ROC curve lay considerably below the ROC curve. Another advantage of using localization data is that the precision of estimates of ROC Az values is greatly improved. This improvement in precision is similar to that resulting from a two- to fourfold increase in the number of cases and/or observers (7).
In our study, a higher level of confidence in interpretation of normality appeared to be more discriminant for performance comparison among observer categories than did the ability to detect a suspicious lung nodule. Radiologists and staff pulmonologists showed less variability in the interpretation of normal radiographs than did other physicians. Potchen et al (1) described similar results for a study that involved 20 board-certified radiologists, and Monnier-Cholley et al (14) came to the same conclusion after an observer test that included normal chest radiographs and radiographs with interstitial lung disease. In the present study, assessment of normality was more difficult because of the fact that normal cases included those of smokers. Finally, the artificiality of the test, in which the observers knew what they were looking for, was probably responsible for a bias toward overdiagnosis (false-positive results) due to the fear of missing a lung cancer (15).
Our localization ROC analysis provided a precise map of regions of concern. In normal cases, the high percentage of hilar localizations identified by observers was probably due to the fact that the observers were required to point out the most suspicious zone (forced to indicate localization). On normal radiographs on which no obvious lesion was visible in the lungs, the hilar regions, which are among the most complex regions in the lung, were often designated. Even on abnormal radiographs, the hila were suspicious to the observer in more than one-third of false-positive cases. Lack of familiarity with hilar anatomy can partly explain this proportion, as previously discussed by Woodring (16). Further analysis of this map could help physicians to focus on areas of doubt and stimulate them to specific study.
Overall consistency in observer detection of lung nodules was poor. This was true both for confidence scoring and for localization. On abnormal radiographs, only 64% of nodule localizations were concordant between study parts 1 and 2. Several investigators have reported variability in radiologists interpretation of chest radiographs in pneumoconiosis or pneumonia (17,18). More recently, Quekel et al (19) assessed observer consistency in the detection of nodular lesions and found inter- and intraobserver
values of less than 0.6, comparable with our findings.
The results of our study do not characterize performance in diagnosis of lung cancer on chest radiographs in general, but they do give an idea of observer performance as measured with a test of a specific design. In our study, which included only cases of initially missed lung cancers and challenging normal cases and in which a reading time limit was imposed, the level of observer performance was low among trained physicians. Performance levels increased with experience, but the difference between observers was mainly due to a greater expressed certainty that the case was normal.
Our study had several limitations: Selection of lung cancer cases was not exhaustive during the study period, and a selection bias thus may have been introduced. The prevalence of abnormal radiographs was high (ratio, 2:1) compared with that in clinical practice, and this might have caused a contextual bias (15). In a recent study, however, Gur et al (20) demonstrated that if a laboratory prevalence effect existed, it was quite small. The fact that observers knew what they were looking for might have affected overall performance. The observers were not allowed to mark more than one location per case, a fact that influenced the measurement of localization ROC Az values. In addition, the time limit imposed for homogeneity of conditions between readings is not relevant to the clinical environment. Finally, the number of cases was low; this limitation, however, was alleviated in part by the relatively high number of observers. These limitations are largely inherent in all observer performance testing because test conditions are not representative of conditions in clinical practice.
The combination of ROC curves with localization ROC curves added interesting information about actual individual performance and distribution of suspected regions; intraobserver consistency was poor both for the degree of suspicion and the localization of abnormality. Chest radiographs should be read by trained physicians who are more capable of distinguishing normality from abnormality.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Authors stated no financial relationship to disclose.
Author contributions: Guarantor of integrity of entire study, L.M.C.; study concepts and design, L.M.C., L.A.; literature research, L.M.C., L.A.; clinical studies, L.A., L.M.C.; data acquisition, L.M.C.; data analysis/interpretation, L.M.C., L.A., F.C., B.P.C.; statistical analysis, F.C., B.P.C.; manuscript preparation and editing, L.M.C., L.A., B.P.C.; manuscript definition of intellectual content, L.M.C., L.A., J.M.T.; manuscript revision/review, L.M.C., L.A., B.P.C., F.C.; manuscript final version approval, all authors
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
F. Li, R. Engelmann, K. Doi, and H. MacMahon Improved Detection of Small Lung Cancers with Dual-Energy Subtraction Chest Radiography Am. J. Roentgenol., April 1, 2008; 190(4): 886 - 891. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Li, R. Engelmann, C. E. Metz, K. Doi, and H. MacMahon Lung Cancers Missed on Chest Radiographs: Results Obtained with a Commercial Computer-aided Detection Program Radiology, January 1, 2008; 246(1): 273 - 280. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |