|
|
||||||||
Gastrointestinal Imaging |
1 From the Diagnostic Radiology Department, Clinical Center, National Institutes of Health, Bldg 10, Room 1C351, Bethesda, MD 20892-1182 (M.H., R.M.S., S.C.Y., L.B., A.L.); National Institute of Biomedical Imaging and Bioengineering (NIBIB)/Center for Devices and Radiological Health Joint Laboratory for the Assessment of Medical Imaging Systems, U.S. Food and Drug Administration (FDA), Rockville, Md (N.P.); Department of Radiology, Georgetown University School of Medicine, Washington, DC (E.M.I.); Department of Radiology, Walter Reed Army Medical Center, Washington, DC (J.R.C.); and Department of Radiology, University of Wisconsin Medical School, Madison, Wis (P.J.P.). From the 2005 RSNA Annual Meeting. Received December 19, 2006; revision requested February 16, 2007; revision received April 4; final version accepted May 4. Supported in part by the intramural research program of the National Institutes of Health Clinical Center (NIBIB). No FDA endorsement of any product or company mentioned in this manuscript should be inferred. Address correspondence to R.M.S. (e-mail: rms{at}nih.gov).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: This HIPAA-compliant study was IRB-approved with written informed consent. Four board-certified radiologists analyzed 60 CT examinations with a commercially available review system. Two-dimensional transverse views were used for initial polyp detection, while three-dimensional (3D) endoluminal and 2D multiplanar views were available for problem solving. After initial review without CAD, the reader was shown CAD-identified polyp candidates. The readers were then allowed to add to or modify their original diagnoses. Polyp location, CT Colonography Reporting and Data System categorization, and reader confidence as to the likelihood of a candidate being a polyp were recorded before and after CAD reading. The area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, and specificity were estimated for CT examinations with and without CAD readings by using multireader multicase analysis.
Results: Use of CAD led to nonsignificant average reader AUC increases of 0.03, 0.03, and 0.04 for patients with adenomatous polyps 6 mm or larger, 6–9 mm, and 10 mm or larger, respectively (P
.25); likewise, CAD increased average reader sensitivity by 0.15, 0.16, and 0.14 for those respective groups, with a corresponding decrease in specificity of 0.14. These changes achieved significance for the 6 mm or larger group (P < .01), 6–9 mm group (P < .02), and for specificity (P < .01), but not for the 10 mm or larger group (P > .16). The average reading time was 5.1 minutes ± 3.4 (standard deviation) without CAD. CAD added an average of 3.1 minutes ± 4.3 (62%) to each reading (supine and prone positions combined); average total reading time, 8.2 minutes ± 5.8.
Conclusion: Use of CAD led to a significant increase in sensitivity for detecting polyps in the 6 mm or larger and 6–9 mm groups at the expense of a similar significant reduction in specificity.
© RSNA, 2007
| INTRODUCTION |
|---|
|
|
|---|
Researchers in a recent reader study investigated whether three-dimensional (3D) viewing improves radiologists' accuracy in classifying computer-identified polyp candidates (5). A second reader study compared the use of a two-dimensional (2D) reading supplemented by concurrent CAD with a primary 3D reading (6). The purpose of our study was to evaluate the effect of CAD as a second reader on radiologists' diagnostic performance in interpreting CT colonographic examinations by using a primary 2D approach, with segmental, unblinded optical colonoscopy as the reference standard.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Original CT Colonography, Optical Colonoscopy, and Reference Standard
The patient data set utilized for our study was a subset of patients in the original prospective study, wherein patients underwent same-day CT colonography and optical colonoscopy. That study included 1233 patient data sets (each data set consisted of supine and prone scans).
Patients underwent standard 24-hour colonic preparation with the oral administration of 90 mL sodium phosphate (Fleet 1 preparation; Fleet Pharmaceuticals, Lynchburg, Va) and 10 mg bisacodyl. As part of a liquid diet, patients also consumed 500 mL barium, 2.1% by weight (Scan C; Lafayette Pharmaceuticals, Lafayette, In) for solid-stool tagging and 120 mL diatrizoate meglumine and diatrizoate sodium (Gastrografin; Bracco Diagnostics, Princeton, NJ) for the opacification of luminal fluid (7).
Distention of the colon was achieved via patient-controlled insufflation of room air immediately before scanning. CT scanning with a four– or eight–detector CT scanner (LightSpeed or LightSpeed Ultra; GE Healthcare, Milwaukee, Wis) was performed while the patient held his or her breath in the supine and prone positions. Scan parameters were collimation, 1.25–2.5 mm; table speed, 15 mm per second; reconstruction interval, 1 mm; and scanner setting, 100 mAs and 120 kVp.
Both same-day and observer-performance study image interpretations were performed with a commercially available CT colonography system (V3D Colon, version 1.2; Viatronix, Stony Brook, NY) to include multiplanar 2D views (transverse, coronal, and sagittal) and a 3D endoluminal view.
The reference standard for the original study (1) was established by using segmental, unblinded optical colonoscopy. Segmental unblinding allows for the assessment of false-negative results at optical colonoscopy that would otherwise have been recorded as false-positive results at CT colonography (1). Polyps that were found with this method and located at same-day CT colonography were manually segmented. CAD performance was determined by comparing each CAD reading with the manual segmentations to determine if it was a true- or false-positive detection (see below).
Observer Study Cohort
Sixty of 1233 original patients were included in our study. Prestudy power analysis was not possible because of the multireader multicase (MRMC) receiver operating characteristic (ROC) reading paradigm, which requires both preliminary data utilizing our CAD algorithm and specialized analytic software (8,9). Moreover, no preliminary data were available for power assessment. These 60 were included on the basis of an expected reading time of 20 minutes per patient, requiring 20 total hours of reading for each radiologist.
Patients were randomly selected to achieve a 1:1 ratio of patients with and without identified polyps 6 mm or smaller (Fig 1). To prevent a patient with a large number of polyps from dominating the overall polyp distribution, random samples of patients from the with-polyps groups were taken until an average of about three polyps per patient was achieved (M.H.). Subsequently, one patient, originally identified as having a polyp 6 mm or smaller, was found to be mislabeled in our database. The polyp information for this patient was later corrected, leading to the initial 31:29 ratio of patients without polyps versus patients with polyps. A total of 81 polyps were identified (average size, 6.3 mm; range, 2–14 mm; up to nine polyps per patient), where size was based on the measured optical colonoscopy diameter of the polyps.
|
|
The per-polyp sensitivity of the CAD algorithm on our set of 60 patients was 42% (10 of 24), 26% (five of 19), 67% (six of nine), and 100% (five of five) for the detection of adenomatous polyps 6 mm or larger, 6–9 mm, 8 mm or larger, and 10 mm or larger, respectively. The CAD algorithm identified eight adenomatous polyps on both supine and prone views. The results suggest that the 60 cases used in our study were somewhat more difficult for the CAD algorithm to evaluate than were those of the general data set from which the cases were selected.
Observer Performance Study Design
Four board-certified radiologists independently read the 60 CT colonography studies (reader 1, R.M.S.; reader 2, L.B.; reader 3, A.L.; and reader 4, E.M.I.; with 7, 2, 0, and 0 years experience reading CT colonography). Readers 1–3 had experience with CAD software, but only reader 1 had experience with the review software. Before interpreting the test patient results, all radiologists received training involving the reading of a minimum of 25 CT colonographic examinations (not taken from the test patient population).
The expected CAD algorithm sensitivity and false-positive rate (2) were provided to all readers before the reading sessions began. The training consisted of five initial CT colonographic examinations where all the intricacies of the colonography and data recording programs were demonstrated. Twenty subsequent examinations were read by each reader utilizing the same protocol used for the test patients. In these training examinations, correct polyp locations were revealed after an initial blinded read to help identify potential reading problems and rectify possible causes of these errors. Readers were also shown pictures of polyps with different shapes (pedunculated, sessile, and flat) and examples of common false-negative and false-positive results.
Our study utilized a sequential read design (13,14). First, a reader had to locate and score all potential polyps without CAD on the supine and prone scans by using a 2D transverse view; 3D endoluminal and 2D reformatted planar views were available for problem solving. Our CAD outputs were incorporated into the software system for easy viewing during the CT colonography reading process. Readers were instructed to identify all polyp candidates 6 mm or larger and record their confidence that the candidate was a true polyp on a scale of 1–100 (1 = definitely not a polyp, 50 = uncertain, and 100 = definitely a polyp). The 100-point quasi-continuous rating scale was selected to help improve the precision of performance estimates, as suggested by Wagner et al (15).
Potential polyps were measured by the readers with 3D electronic calipers. Readers also recorded the view (supine or prone) on which the polyp candidate was seen, its shape (pedunculated, sessile, or flat), its location (rectum, sigmoid colon, descending colon, splenic flexure, transverse colon, hepatic flexure, ascending colon, or cecum), and if the polyp candidate was on a fold or not. Treatment recommendations were also recorded for all patients and each identified polyp according to the proposed CT Colonography Reporting and Data System guidelines (16) used in our study: C1, normal colon or benign lesion, continue routine screening; C2, intermediate 6–9-mm polyp or indeterminate finding, surveillance or colonoscopy recommended; C3, polyp, possibly advanced adenoma, follow-up colonoscopy recommended; or C4, colonic mass, surgical consultation recommended.
The initial reading was referred to as without CAD. Once that reading was completed, CAD prompts were immediately turned on and CAD readings for supine and prone positions were evaluated by the reader. The reader was then allowed to add and score any newly identified polyps. Readers were free to update their confidence scores for any previously identified lesions, their overall CT Colonography Reporting and Data System classification, and suggested follow-up for the patient on the basis of the CAD prompts. The second reading is referred to as with CAD.
The CT colonographic images with and without polyps were intermixed and shown in random order to each reader. The radiologists were limited to 20 minutes per examination to mimic a clinical practice situation. A dual-monitor workstation was used, with the scans displayed on one screen and the reader study controls displayed on the other. The amount of time the radiologist used for the readings with and without CAD was recorded.
Statistical Analysis
The readings with and without CAD and the differences in sensitivity and specificity for each reader and for the reader average were determined by using the CT Colonography Reporting and Data System categorization given for each patient. Any recommendation other than normal 10-year follow-up for a patient was considered as a positive recommendation in the analysis of sensitivity and specificity. Readings with and without CAD and the differences in area under the ROC curve (AUC) for individual readers, as well as for the average reader, were also determined (N.P.).
The highest confidence score for readings with and without CAD was used as the reader's confidence level for the patient and was subsequently used in the ROC performance estimates. If the reader did not identify any polyp candidates in a patient, the confidence rating was set to a score of 0.
MRMC ROC analysis is the preferred method for measuring the clinical utility of CAD algorithms with AUC as a common ROC summary performance measurement (17). MRMC analysis is preferred because conclusions generalize not only to a new set of cases, but also to a new set of readers. The ROC paradigm allows one to distinguish the intrinsic disease-detection performance of the test from the level of aggressiveness of the reader or observer in setting a threshold for action (8). AUC, defined as the expected sensitivity averaged across all specificities, was also utilized because it often eliminates the ambiguity of trading specificity for an increase in sensitivity, which is common when comparing sensitivity and specificity operating points for readings with and without CAD.
The radiologist false-positive results were reviewed by reader 1 after completion of the entire study. The false-positives were classified as being most likely a result of normal haustral folds, stool, or ileocecal valves.
We reported 95% confidence intervals for the difference between readings with and without CAD for individual radiologists and the average reader. We also reported significant differences in performance for each radiologist and the average reader by using a P value of less than .05 to indicate a significant difference. Our statistical assessment is based on an MRMC methodology that included both reader and case variability in the analysis (18,19). This more general analysis paradigm usually produces wider error bars, compared with only considering case variability.
The advantage of MRMC is that the reported results can be generalized to the populations of readers and cases. Therefore, any conclusions can apply not only to our four radiologists but also to a population of similar radiologists reading CT colonographic images. The individual reader and MRMC confidence intervals for sensitivity and specificity were calculated on the basis of bootstrap resampling (19). The sensitivity and specificity for readings with and without CAD were correlated with each other because the assessments immediately followed one another. This correlation precludes the use of the most common binomial sensitivity assessment methodology to assume uncorrelated estimates (20).
Patient-based bootstrap resampling was utilized to properly account for this correlation. The 95% confidence intervals and significance of AUC for each reader and for the readers as a whole were obtained by using a modified version of the University of Chicago MRMC software (version 0.9B) (21), which is based on the Dorfman-Berbaum-Metz algorithm (18). LABMRMC was modified from its original area under a fitted binormal ROC curve, or Az, to the AUC for our analysis; all other components of the LABMRMC software were the same. Reading times before and after CAD were also tracked and recorded. Statistical comparisons of reading times were made with the paired t test, and a P value of less than .05 indicated a significant difference.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
|
ROC Analyses
The individual and average reader ROC results show improvements in AUC for all three size groups (Table 3). While none of the results achieved significance, there was a trend toward improved AUC in each group.
|
Reading Times
The average time to read each case without CAD was 5.1 minutes ± 3.4 (standard deviation) (Table 4). Readings with CAD added 3.1 minutes ± 4.3 (62%), yielding an average read time of 8.2 minutes ± 5.8 for each patient. The radiologists were classified in two groups: Two readers had total read times of less than 6 minutes and two readers had total read times averaging about 12 minutes.
|
| DISCUSSION |
|---|
|
|
|---|
All four radiologists achieved higher ROC performance with CAD readings than without for patients with a maximum adenomatous polyp size in the 6 mm or larger and 6–9-mm groups, while only the two least experienced radiologists for the 10 mm or larger group saw improved performance with CAD than without CAD. None of the individual or average radiologist AUC differences achieved significance for patients with a maximum adenomatous polyp in any group (P > .05). Possible reasons for this include the small number of readers and cases and the lower sensitivity of the CAD algorithm on the selected cases.
These improvements in sensitivity and ROC performance strongly suggest that CAD as a second reader improved the identification of patients with smaller adenomatous polyps. In addition, these data support the conclusion that the trade-off of increasing sensitivity at the expense of specificity is worthwhile because it results in higher ROC performance.
As one might expect, the least experienced readers tended to have the strongest improvement in ROC performance. Reader 3 (with no CT colonography experience, but some with CAD) saw strong improvement across the 6–9-mm and the 10 mm or larger groups. Reader 4 (no CT or CAD experience) had only a slight (0.01) increase in ROC performance for patients with 6–9-mm polyps, but a large (0.11) improvement for patients with polyps 10 mm or larger. The more experienced readers showed strong improvement for the 6–9-mm group, but a reduction in performance for the 10 mm or larger range, owing to their perfect sensitivity in this group so that any false-positive result prompted by CAD, even one smaller than 10 mm, could only hurt performance.
The sensitivity improvement tended to mirror the ROC improvements for each reader, although it is interesting to note that reader 2 experienced the largest increase in sensitivity and ROC performance, as well as the largest decrease in specificity. This suggests that reader 2 was more willing to utilize the CAD information than were the other readers. CAD may assist less experienced readers in detecting larger adenomatous polyps, but the overall benefits of CAD are likely also tied to how an individual reader interacts with the CAD program.
Our sensitivities for readings with CAD of 61% and 95% for patients with polyps 6 mm or larger and 10 mm or larger, respectively, are comparable to the 63% and 83% sensitivities reported by Taylor et al (6). Both studies utilized primary 2D image analysis. However, we limited our study to adenomatous polyps while Taylor et al evaluated all polyps. The other main differences between the studies are the CAD algorithm and its application. In particular, Taylor et al applied CAD concurrently while our CAD algorithm was used as a second reader.
We found the performance for readings without CAD to be poor compared with the results reported in the original trial that included the same cases (1). There are several possible explanations for this difference. First, we used a primary 2D image interpretation paradigm in our study compared with the primary 3D fly-through approach used in the original trial.
Second, the case selection consisted of a much smaller cohort of 60 patient data sets, with more difficult polyps and a substantially smaller fraction of healthy patients. The use of difficult cases has been encouraged by Metz et al (22) to maximize the differences between readings with and without CAD, thereby minimizing the number of cases and readers required for our study. That the polyps may have been more difficult to detect is supported by the lower detection performance for CAD on these cases. For example, the CAD algorithm performance on polyps 8 mm or larger was 80.8% in the original trial (2), compared with only 67% for the subset of cases used in our reader study.
Third, the retrospective reading paradigm used in our study may have led to less vigilance than the prospective paradigm used in the original trial in which the radiologists were, in a sense, competing nearly in real time against colonoscopy.
Finally, the radiologists participating in our study had different backgrounds and skill sets compared with the radiologists in the original trial.
Developing study designs that accurately measure the influence of CAD is critical. Simply relying on measuring an increase in sensitivity is not enough because this is almost always accompanied by a decrease in specificity, as observed in our study. The appropriate trade-off between increased sensitivity and decreased specificity is difficult to quantify. This limitation is the primary rationale for conducting ROC experiments (17). However, one should be aware that simply asking the reader to record his or her confidence is not enough to guarantee useful data.
In our study, all of the readers had some trouble using the 100-point scale, especially when spreading their scores out across the suspicion scale. Reader 4, in particular, had very little spread in confidence scores between patients so that a binormal ROC curve could not be fit to this reader's data. This precluded the use of parametric binormal ROC curve fitting, the standard in MRMC software, and led to our reliance on empirical ROC analysis.
In addition, the sensitivity and specificity results, on the basis of a CT Colonography Reporting and Data System treatment decision for the patient, did not always trend the same way as the ROC performance curves, which were based on the highest rated polyp in the patient. Reader 2, for example, had large increases in sensitivity and ROC performance for the 6–9-mm group but also a large reduction in specificity.
In contrast, reader 3 had a small increase in sensitivity and a small specificity penalty but a substantial increase in ROC performance for the 6–9-mm group. These mixed results suggest that more training and additional monitoring are necessary, focusing not just on the use of the CT colonography or the CAD software but also to better familiarize the reader with the scoring mode utilized in ROC experiments.
Reading times were not strongly correlated to overall performance or to the CAD benefit in our study. We did find that reading times (5.1 minutes ± 3.4) were much shorter than anticipated for all readers; in fact, they were significantly shorter (P < .001) than the prospective reading times of 19.8 minutes ± 7.5 for the same 60 cases in the original study (1).
One reason for the sizeable disparity is that the original reviewers had to assess not only the colon but also the extracolonic regions in the scans, which would add time to their interpretations. As part of the original study for assessing interobserver agreement, a second radiologist at a different center retrospectively reviewed only the colon in 100 randomly selected cases (1). The average read time for these cases was 8.0 minutes, suggesting that a portion of our 14.7-minute difference could be associated with the evaluation of extracolonic findings.
Another possible reason for the difference is in the use of primary 2D reading in our study and primary 3D reading in the prior study. Taylor et al (6) found that 3D endoluminal interpretation increased reading time about 30% over their 2D analysis with CAD. Part of the time difference between our study and the original may be associated with our retrospective reading designs and/or insufficient reader training, resulting in our readers not approaching every case with the same rigor as they might have during actual clinical practice.
The reading times showed that CAD adds about 62% on average to the overall reading time for a CT colonography patient. This increase appears similar regardless of how quickly a reader evaluated the cases, with the fastest readers finding about the same increase as the slower readers. If this trend holds true for the expected longer clinical reading times associated with primary 3D CT colonography reading, the impact of CAD on patient throughput could be substantial. The increased reading time associated with some CT colonography CAD systems as second readers may limit CAD implementation because of its negative impact on patient throughput. Further prospective studies are necessary to measure the full impact of CT colonography CAD on current clinical workflow.
Our study had limitations. First, the number of patient cases utilized in our study was modest, especially for patients with polyps 10 mm or larger, which only included five patients. This limited our ability to see significant changes in the AUC analysis. Second, the radiologists conducted primary 2D reading without and then with CAD. Our reported results would not be expected to generalize to a primary 3D fly-through reading paradigm. Third, the case set was difficult for both the radiologist and the CAD algorithm. While this "stress test" should be more efficient in identifying the benefits of CAD, the reported benefits may not match what is found in routine clinical practice.
Fourth, the sensitivity and specificity confidence intervals and significance are calculated on the basis of bootstrap analysis methods. The bootstrap was necessary to correctly account for reader and case variability; however, the bootstrap does not always effectively sample the tails of the difference distribution, leading to additional uncertainty in confidence interval and significance estimates. Fifth, the readers had some trouble spreading their scores out across the suspicion scale, complicating the ROC analysis and interpretation.
In conclusion, we found that this CAD system, when used by radiologists as a second reader, led to increased sensitivity and decreased specificity for detecting polyps 6 mm or larger and 6–9 mm in size.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| IMPLICATION FOR PATIENT CARE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: AUC = area under the ROC curve CAD = computer-aided detection MRMC = multireader multicase analysis ROC = receiver operating characteristic 3D = three-dimensional 2D = two-dimensional
Guarantor of integrity of entire study, N.P.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; literature research, M.H., R.M.S.; clinical studies, N.P., L.B., E.M.I.; experimental studies, N.P., M.H., R.M.S., S.C.Y.; statistical analysis, N.P.; and manuscript editing, N.P., M.H., R.M.S., S.C.Y., L.B., E.M.I., A.L.
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |