|
|
||||||||
Gastrointestinal Imaging |
1 From the Department of Specialist X-Ray, University College Hospital, 2F Podium, 235 Euston Rd, London NW1 2BU, England (S.A.T., R.G., S.H.); Department of Intestinal Imaging, St Mark's Hospital, Harrow, England (R.I., E.T., V.A.S., D.B., P.B.); Department of Radiology, Beijing Friendship Hospital, Beijing, China (J.Z.); and Abdominal Imaging Section, University of Wisconsin Medical School, Madison, Wis (P.J.P.). Received May 9, 2007; revision requested July 13; revision received August 17; accepted September 19; final version accepted October 12. Supported in part by the Department of Health's NIHR Biomedical Research Centres funding scheme. Address correspondence to S.A.T. (e-mail: csytaylor{at}yahoo.co.uk).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: Ethics committee approval and informed consent were obtained for this HIPAA-compliant study. Four readers each read 48 data sets (26 men, 22 women; mean age, 57 years) from a screening population (three containing polyps) without CAD application, followed by review of the CAD output and recorded findings and diagnostic confidence. The 45 data sets that were designated as normal were chosen such that 22 generated 15 or fewer FP CAD marks and 23 generated more than 15 FP CAD marks. Sensitivity, specificity, and receiver operating characteristic (ROC) curves were calculated with and without CAD. The relationships between the number of CAD FP marks and reader confidence, reporting times, and correct data set classification were analyzed by using linear and logistic regression.
Results: Across all readers, CAD resulted in four additional FP detections. Overall reader sensitivity and specificity (6-mm polyp threshold) before and after CAD application were 0.75 (95% confidence interval [CI]: 0.43, 0.95) versus 0.83 (95% CI: 0.52, 0.98) and 0.96 (95% CI: 0.91, 0.98) versus 0.93 (95% CI: 0.88, 0.96), respectively. The area under the ROC curve increased from 0.57 (95% CI: 0.34, 0.80) to 0.61 (95% CI: 0.42, 0.80). There was no correlation between an increasing number of CAD FP marks and reader confidence (P = .71) or correct study classification (P = .23), but there was a positive correlation with CAD-assisted reading times (0.06 [95% CI: 0.02, 0.10], P = .002).
Conclusion: Increasing numbers of CAD FP marks did not adversely influence correct reader study classification or diagnostic confidence, although reporting times did increase.
© RSNA, 2008
Supplemental material: http://radiology.rsnajnls.org/cgi/content/full/2471070816/DC1
| INTRODUCTION |
|---|
|
|
|---|
For computed tomographic (CT) colonographic screening to be cost effective, it is important that unnecessary colonoscopy precipitated by false-positive (FP) CT colonographic interpretations must be kept to a minimum (10,11). All colon CAD systems generate FP marks and there is potential for these to adversely influence reader specificity and/or efficiency in a low-prevalence screening setting, an observation well described for screening mammography (12). However, it is unknown whether the actual number of CAD FP marks matters or an increase in the number of marks influences the effectiveness of CAD. Thus, the purpose of our study was to retrospectively evaluate the effect of increasing numbers of CAD-generated FP marks on reader specificity and reporting times by using CT colonography in a low-prevalence screening population.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Ethics committee approval and patient consent were obtained from the donor institution for the CT colonographic data sets used in this Health Insurance Portability and Accountability Act–compliant study.
Data Set Preparation
A colon CAD system (ColonCAD API, version 2.0; Medicsight) was applied to a database of CT colonography studies collected from an ongoing single-site screening program (University of Wisconsin Hospitals and Clinics, Madison, Wis) (Fig 1). Studies had the following characteristics: bowel preparation, 45 mL sodium phosphasoda (18 hours prior to CT colonography; Fleet Pharmaceuticals, Lynchburg, Va) with 2% barium suspension (250 mL, 15 hours prior to CT colonography; Scan C, Lafayette Pharmaceuticals, Lafayette, Ind) and diatrizoate meglumine and diatrizoate sodium (60 mL, 12 hours prior to CT colonography; Gastrografin, Bracco Diagnostics, Princeton, NJ). Scan parameters for the 16-section CT scanner (LightSpeed; GE Healthcare, Milwaukee, Wis) were 1.25-mm collimation, 1-mm reconstruction interval, 120 kVp, and 50–75 mAs. By using a software harness, the total number of CAD marks (supine and prone combined) was counted for each data set (patient) that was originally reported as normal (ie, containing no polyps
6 mm) by five experienced program radiologists (P.J.P. and four nonauthors, with experience with 500–1000 endoscopically verified data sets each).
|
These 48 data sets were then reread by two additional radiologists (D.B., S.A.T., experienced with 800 and 900 endoscopically verified CT colonography studies, respectively) with and without CAD application (see below) to confirm complete radiologic normality and that all CAD marks were FPs in the 45 negative studies and to locate the known polyp by using colonic segment and section numbers in the three positive studies. The readers also assessed the quality of the bowel preparation and distention for each data set, noting whether areas of colonic mucosa remained unseen owing to fecal residue and/or untagged fluid or collapse, and classified each study by consensus as either diagnostically acceptable or not (ie, necessitating a repeat study owing to inability to exclude a
6-mm polyp).
Power calculation.—Given the potential statistical deficiencies powering a study by using continuous data (in this case, CAD FP marks) with a binary outcome (reader study classification as FP or true-negative), the power calculation was performed by using a binary cutoff. Given previous data (7), it was assumed that observers using CAD would generate FP detections in 10% of normal studies with 15 or fewer FP marks. We hypothesized that this would increase to 30% with more than 15 FP marks per study. To detect this difference (at 5% significance and 80% power), 72 (144 total) readings in each group were required. This was increased to 180 readings to allow for some nonindependence of data by using more than one reader.
Reader selection and CAD workstation integration.—Four radiologists (E.T., R.G., R.I., V.A.S.) took part in the study. All had been at a 2-day dedicated CT colonography workshop within 6 months preceding the study, were familiar with the workstation (Vitrea 3.8; Vital images), had read at least 75 endoscopically validated data sets (range, 75–120), and were reporting findings at unaided CT colonography in daily clinical practice.
Before the study, readers were provided with historical data relating to CAD performance (13). In brief, the data described external validation of the CAD software, providing expected sensitivity and FP rates in similar CT colonographic data sets, at the settings used for the present study. Readers were also given a 1-hour tutorial on the specific integration in the workstation software. The workstation integration used for the study has been described elsewhere (6,7), In brief, the software segments the colon included in the CT data set and determines the inherent sphericity of all objects projecting into the colonic lumen. Detections with a sphericity above a predetermined threshold level are then prompted visually to the observer by using small red dots superimposed over the region of interest on two-dimensional (2D) transverse and three-dimensional (3D) endoluminal views. The CAD iteration utilized analyses of each CT scan acquisition (supine or prone) independently, and, as such, CAD marks are not matched between the supine and prone positions.
Reading sessions.—To mimic clinical practice, readers were informed that studies were acquired from an asymptomatic screening population but were given no other information about the prevalence of abnormality or the aims of the study. Each reader was provided with a list of study numbers in randomized order and was instructed to independently read the studies (see below) over a period of 2 weeks (to mimic normal reporting volumes).
Reading paradigms.—Readers were free to use the full functionality of the workstation (ie, 2D transverse, multiplanar reformations, 3D cube, and full endoluminal fly-through images), mirroring normal clinical practice, and were instructed to first analyze each case (ie, prone and supine data sets) without CAD, as per their usual clinical practice (unassisted read). Readers were specifically told to follow CT Colonography Reporting and Data System (CRADS) guidelines (14) (ie, only studies containing a polyp measuring
6 mm were considered abnormal; polyps were measured as per CRADS guidelines). Readers were free to measure polyps by using either 2D multiplanar reformation or 3D endoluminal views, according to their usual practice. Readers noted interpretation time (defined as time taken to read the data set once opened on the workstation) on a study sheet along with each perceived abnormality noting colonic segment, 2D transverse section number, lesion size (in millimeters), and overall diagnostic confidence that the case was normal (scored from 1 [least confident] to 100 [most confident]). Readers were not told to use a particular confidence score to indicate if they would recommend colonoscopy in clinical practice.
Once this initial read was complete, readers immediately applied the preprocessed CAD and reviewed the case again. There was no software functionality to move automatically from CAD mark to CAD mark (eg, by hitting a specific keyboard key) and readers assessed each CAD mark by scrolling through the data set. Readers documented any additional findings seen with CAD and were permitted to discard any of their unassisted findings. Readers then revised overall case confidence in light of CAD and recorded the additional time taken. At the end of the study, the preferred primary reading method (2D or primary 3D endoluminal fly-through) was documented for each reader.
Case marking.—A radiologist (J.Z., experienced with 300 endoscopically validated CT colonographic data sets) who did not take part in the main study reviewed the reader report forms, documenting reader performance against known patient status. In consensus with two other radiologists (D.B., S.A.T., experienced with 800 and 900 endoscopically verified studies, respectively), the causes of all reader FP marks were classified in seven categories as follows: (a) bulbous fold, a prominent fold in an otherwise well distended segment; (b) segment under distension; (c) fecal residue and/or residual fluid; (d) normal colonic anatomy (eg, ileocecal valve, redundant mucosa, internal hemorrhoid); (e) extracolonic; (f) diminutive (
5 mm) polyp; and (g) unexplained. Finally, all normal data sets were reviewed and all CAD FP marks classified in the same way. Because the particular iteration of the CAD system utilized does not provide sizes of lesions detected with CAD, instead marking a region of interest for radiologist review, it was not possible to rank CAD FP marks according to size.
Statistical Analysis
Effect of CAD.—Per-case sensitivity and specificity were calculated with and without CAD. The effect of CAD on reader confidence was assessed by using a paired t test. Receiver operating characteristic (ROC) curves were generated for each reader with and without CAD on the basis of the association between confidence level and correct case classification (normal vs abnormal).
Effect of CAD FP marks.—The distribution of the causes of CAD FP marks was compared between those data sets generating 15 or fewer CAD FP marks and those generating more than 15 CAD FP marks by using a
2 test, and the number of reader FP detections before and after CAD were calculated. Linear regression was used to examine the relationship with reader case confidence and reporting times. The effect of CAD FP numbers on correct reader classification of normal data sets was also examined by using logistic regression. Robust standard errors (by using the Huber, White, and sandwich estimators of variance) (15) were used for all regression analyses to account for the fact that each case was included in the analysis four times (one for each observer).
| RESULTS |
|---|
|
|
|---|
CAD Performance
CAD correctly detected all three polyps in the abnormal studies, generating 13, 18, and 25 FP marks per whole data set. In the normal studies, the contribution of each cause of CAD FP marks (Table 1) was not significantly different between those data sets with 15 FP marks or fewer and those with more than 15 FP marks (P = .69).
|
|
|
|
|
|
|
|
|
ROC Curves
The area under the ROC (AUC) increased with CAD for two of four readers, was unchanged for one, and marginally decreased for one (Table 4).
|
Reporting times.—The mean reporting time across all four readers was 8.6 minutes (standard deviation, 3.6) for the unassisted read and 3.6 minutes (standard deviation, 1.5) for the CAD read (42% increase). The regression coefficient relating the number of CAD FP marks to CAD reporting time was 0.06 (95% CI: 0.02, 0.10) (P = .002), indicating a small but significant positive correlation between increasing CAD FP marks and reading time. The additional time for review of the CAD output for studies with 15 or fewer FP marks was 3.3 minutes (standard deviation, 1.4) and for studies with more than 15 FP marks was 3.9 minutes (standard deviation, 1.6).
Correct case classification.—For each additional CAD FP mark, the odds of readers correctly classifying a normal case were 1.14 (95% CI: 0.92, 1.40) (P = .23), indicating no significant detrimental effect of increasing numbers of CAD FP marks on correct reader case classification.
| DISCUSSION |
|---|
|
|
|---|
By keeping CAD settings constant throughout, we ensured that the contribution of each cause of CAD FP marks was constant among all data sets. We did not adjust the CAD output in any way to artificially change the number or type of CAD marks. The main causes of FP marks were fecal and/or fluid residue and normal colon anatomy. The CAD system we used had been tested on data sets by using oral tagging agents (19), and our data again support the concept that, in general, CAD FP marks are easily dismissed by trained radiologists (20). The plausibility of CAD FP marks is perhaps more important than the actual number.
Clearly plausible CAD marks require greater radiologist work-up than do easily dismissed detections and are more likely to produce an actual radiologist report with FP findings. Indeed, it could be argued that if CAD produces more than 15 marks per data set, many are likely to be dismissed with relative ease. However, it is difficult to define a plausible CAD mark and numbers will differ between individual data sets. For example, in our study, CAD FP marks resulting from fecal residue and bulbous folds were clearly deemed plausible by some readers (resulting in FP detections) and dismissed by others. All CAD FP marks were dismissed as such by our experienced radiologists, but this was not the case for the less experienced readers. Because of this subjectivity, we did not attempt to grade plausibility of CAD FP marks, although we indirectly assessed this by recording reader confidence levels after CAD, but this concept clearly requires further study.
It could be argued that our data support the notion that as many as 25 CAD FP marks do not affect radiologist performance. While this seems to be the case for specificity (our main study aim), with only three positive data sets, we do not know whether this also holds true for reader sensitivity. The small number of positive studies is reflected by very large confidence limits in our sensitivity data. We also did not include 6- to 9-mm polyps, which was also an artificial stipulation for the study. However, our main aim was to test specificity in a low-prevalence population. We did include some abnormal studies so readers would not assume that the whole data set was normal, thus skewing their interpretation, but did not attempt to provide robust sensitivity data across a range of polyp sizes. Many data sets will be required to provide adequate power to test this while maintaining the low prevalence of abnormality in a screening population, which most CAD studies to date lack. It also follows that our data may not be directly applicable to other CAD systems with differing spectrums of FP marks.
We did demonstrate a positive correlation between increasing numbers of CAD FP marks and reporting times, although arguably this may have limited clinical significance—every additional CAD FP mark added 0.06 minutes (just under 4 seconds). This is probably inevitable, since time must be taken to analyze each CAD mark, although it would likely seem the benefit of CAD will outweigh the relatively minimal increase in reading time.
Even though all three polyps were correctly detected by using CAD, two of four readers failed to detect one 10-mm polyp, again emphasizing that observers may reject bona fide CAD prompts. The large polyp marked by CAD but missed by readers was not particularly subtle (although coated with tagged fluid). Why correct CAD detections are dismissed is unclear but is likely dependent in part on reader experience.
CAD also had a clinically unimportant (albeit significant) positive effect on reader confidence, suggesting readers were at least reassured by CAD that they had not missed anything. The use of a 100-point confidence scale (as opposed to defined categories) has been recommended for studies by using ROC curves and was indeed the initial recommendation of the Breast Imaging Reporting and Data System (21). By its nature, this scale does not include actionable threshold levels (eg, the level required to trigger colonoscopy), but is a measure of a reader's certainty that a case is normal. However, even by using this 100-point scale, readers chose a finite number of confidence levels, explaining the number of data points on the ROC curves. We accept that a category-based scale would have been just as effective.
Our study had limitations. Our normal studies did not have colonoscopic and/or histologic correlation, although all were deemed radiologically normal by three radiologists (with a combined experience of over 2000 validated data sets) by using CAD. Importantly, one of these radiologists had no previous contact with the readers (it could otherwise be argued that the study readers were influenced by prior instruction given by the experienced radiologists). Furthermore, all reader FP marks were reviewed and their cause confidently determined so that even if CT colonography datasets did not contain occult neoplasia, this would not have affected the results. Although endoscopically verified CT colonographic data sets are now altruistically provided on the Internet, we wanted to use representative data sets from an ongoing screening program rather than risk potentially using hand-picked data sets provided online. It should be noted that lack of an independent histologic reference standard is common in lung and mammographic studies. We deliberately rejected a paired study design whereby readers analyzed the same data sets with differing number of CAD FP marks. Such a design risks recall bias and would be difficult to perform without explicitly alerting the readers to the study purpose.
Furthermore, by not artificially increasing CAD FP marks (eg, by changing the CAD filter settings or manually adding new prompts), we ensured the causes of the FP detections were similar across high- and low-prevalence data sets. An alternative study design would have been a randomized study. It could be argued that studies with more than 15 CAD FP marks were somehow intrinsically different from those with less. However, data sets were acquired from the same source, generated equivalent numbers of unassisted reader FP marks regardless of the subsequent number of CAD marks, and had all been graded as clinically adequate by three experienced radiologists. Ultimately, if studies with 15 or more CAD FP marks were of inferior technical quality to the others, this would act in favor of our null hypothesis. One important consideration is our inability to ensure that readers questioned every CAD mark, and it is possible some may have been overlooked. Many CAD systems allow readers to move automatically from CAD mark to CAD mark via a mouse click or keyboard key, which may be a more robust method of ensuring all marks are reviewed. Finally, all four readers expressed a preference for primary 2D analysis, and it may not be possible to fully extrapolate the data to those preferring primary 3D endoluminal review.
In summary, we found no evidence that increasing numbers of CAD FP marks adversely influenced either correct reader case classification or diagnostic confidence, but they did prolong reporting times.
| ADVANCE IN KNOWLEDGE |
|---|
|
|
|---|
| FOOTNOTES |
|---|
Abbreviations: AUC = area under ROC curve CAD = computer-aided detection CI = confidence interval FP = false-positive ROC = receiver operating characteristic 3D = three-dimensional 2D = two-dimensional
See Materials and Methods for pertinent disclosures.
Author contributions: Guarantor of integrity of entire study, S.A.T.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, S.A.T., P.J.P.; clinical studies, S.A.T., R.G., R.I., E.T., V.A.S., D.B., J.Z., P.J.P., S.H.; statistical analysis, P.B.; and manuscript editing, all authors
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. A. Taylor, J. Brittenden, J. Lenton, H. Lambie, A. Goldstone, P. N. Wylie, D. Tolan, D. Burling, L. Honeyfield, P. Bassett, et al. Influence of Computer-Aided Detection False-Positives on Reader Performance and Diagnostic Confidence for CT Colonography Am. J. Roentgenol., June 1, 2009; 192(6): 1682 - 1689. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Sadik, M. Suurkula, P. Hoglund, A. Jarund, and L. Edenbrandt Improved Classifications of Planar Whole-Body Bone Scans Using a Computer-Assisted Diagnosis System: A Multicenter, Multiple-Reader, Multiple-Case Study J. Nucl. Med., March 1, 2009; 50(3): 368 - 375. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Regge, C. Hassan, P. J. Pickhardt, A. Laghi, A. Zullo, D. H. Kim, F. Iafrate, and S. Morini Impact of Computer-aided Detection on the Cost-effectiveness of CT Colonography Radiology, February 1, 2009; 250(2): 488 - 497. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |