|
|
||||||||
Evidence-based Practice |
1 From the Department of Specialist Radiology, University College Hospital and University College London, 235 Euston Rd, Podium Level 2, London NW1 2BU, England (C.R., S.H., S.A.T.); and the National Health Service Centre for Statistics in Medicine, Wolfson College Annex, Oxford, England (D.G.A., S.M.). Received January 18, 2007; revision requested March 22; revision received May 24; final version accepted July 6. S.H. and S.A.T. are remunerated consultants for Medicsight, and Medicsight has contributed to costs of other related research at the Centre for Statistics in Medicine. Address correspondence to S.H. (e-mail: s.halligan{at}ucl.ac.uk).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: MEDLINE was searched to identify study articles meeting the inclusion criteria for describing CAD for CT colonography in human subjects. Data were extracted from eligible articles, grouped into five domains: technical description of CAD algorithm, description of subjects, acquisition of data, evaluation strategy used, and presentation of results. Primary studies were scored for each domain and overall findings plotted as star plots.
Results: Although 21 (91%) of the 23 studies included presented technical details of the CAD algorithm, methodologic details used for model development and validity were generally poor. Investigators in six (26%) studies described the evaluation data set sufficiently for replication; investigators in eight (35%) studies described age and sex demographics for subjects in whom CAD was tested. Investigators in 11 (48%) studies presented polyps per subject. Investigators in 12 (52%) studies described the reference standard against which CAD was judged; 11 (48%) studies explicitly distinguished between development and evaluation data. In nine (39%) studies, the evaluation strategy used to test CAD could not be deduced at all. Description of subjects included for CAD development and evaluation was most poorly reported, with an average score per study of 33% in this domain.
Conclusion: The reporting quality for studies of CAD for CT colonography is highly variable; key methodologic details needed for informed assessment of the generalizability of results are frequently omitted, for which a minimum data set based on the observations is proposed.
© RSNA, 2008
| INTRODUCTION |
|---|
|
|
|---|
Computer-aided detection (CAD) is effective when radiologists must detect small lesions that occur infrequently, for example during mammography (12) or with pulmonary nodules (13). CAD systems for CT colonography have become commercially available, and such systems may accelerate interpretation time and improve sensitivity for inexperienced observers (14–16). Potential purchasers will be most interested in the performance characteristics of such systems—namely, sensitivity and specificity for polyp detection at different diameter thresholds. Sensitivity and specificity are well-established parameters for assessment of CAD software, and their documentation is required for regulatory approval (17). However, many evaluation strategies can be used to ascertain these parameters, and some strategies are more stringent than others (18,19). In particular, strategies that test CAD software by using data not encountered before—data obtained from different institutions (external validation)—are superior to strategies in which the software has encountered the data before (cross-validation) or in which the data have been obtained from the same institution contributing development data (internal validation) (18).
Our anecdotal experience suggested that research articles on studies in which CAD for CT colonography was evaluated frequently omit key methodologic details. Such omissions prevent fully informed interpretation of the data presented, a deficiency also noted by others (20). Thus, the purpose of our study was to determine objectively the current standard of reporting for studies of CAD for CT colonography by systematically reviewing published articles.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Study Selection
The electronic abstract for each publication identified was scrutinized by the same researcher (C.R.), and articles of primary studies were marked as potentially eligible if they apparently described assessment of the performance characteristics of CAD for CT colonography. We did not use search limits. We defined CAD as a software algorithm that made decisions about the presence or absence of polyps and cancers independent of human observers and that indicated the spatial location of perceived abnormalities to the observer.
To be eligible, studies must have been performed in vivo in humans. The focus was detection of polyps and cancer by using the combination of CT colonography and CAD. Any study with artificially or computer-generated polyps was excluded. There was no eligibility requirement for the type of CT data acquired (eg, number of detector rows or collimation). A reference test to confirm or reject polyps was not required because this was an outcome of the review. If eligibility was doubtful, the article was marked as potentially eligible.
Article titles and abstracts for potentially eligible studies were printed and then assessed for eligibility by two researchers (S.H., S.A.T.), each of whom had considerable experience with CT colonography in both clinical and research settings (8 years of experience of peer-reviewed publication in CT colonography research and more than 1000 individual subject examinations reported each). They also had prior experience with data extraction for systematic review and meta-analysis. All abstracts were assessed by each researcher working independently. The two researchers then made a short list of potentially eligible studies in consensus, the full printed articles of which were subsequently obtained for data extraction.
Data Extraction
Two researchers (S.H., S.A.T.) read the full texts of potentially eligible studies and extracted data independently onto a data sheet. The two extractions were compared and uncertainty was resolved by means of consensus. Extracted data related to several key details (described later in the article) that were defined in advance of the extraction (at the study protocol stage). There were five broad groupings: (a) description of the CAD algorithm, (b) description of the subjects in whom CAD was developed and evaluated, (c) acquisition and nature of the data for CAD development and evaluation, (d) description of the evaluation strategy used to validate CAD performance, and (e) presentation of results.
In general, CAD software must first be trained to recognize abnormalities and reject normal regions. This ability is achieved by exposure to known normal and abnormal data during a training or development phase (17). Once satisfactory performance is achieved, performance is assessed subsequently during a test or evaluation phase. Because performance may be influenced substantially by the nature of the data used for development and evaluation, we wished to determine if authors described clearly and fully the origin of the data, the data characteristics, and the evaluation strategy used to assess CAD. For example, patients with symptoms have larger polyps than do patients without symptoms, and larger polyps are more likely than small polyps to be detected by means of CAD. Also, patient characteristics may differ between hospitals, depending on referral patterns, so authors should describe the precise origin and nature of data. For example, overly optimistic estimates of CAD performance may arise when CAD is tested on the same data used for development or if development and evaluation data sets have a common origin (18).
For each primary study, we noted if data origin was described in the article, whether age and sex demographics were described in the article, and whether data set composition could be replicated by other researchers (eg, was the proportion of subjects at screening or with symptoms stated? Were patients without polyps included?). Because the technical quality of data may affect CAD performance, we noted whether details of CT collimation and milliampere-second value were described and whether tagged bowel preparation was used. We noted if the reference test to validate polyps was described and whether polyp number and size were presented. We also noted whether per-subject details of polyp number and size were described because data clustering (ie, individual subjects with multiple polyps) could influence CAD performance, either positively or negatively, depending on the quality of the data.
We noted whether there was a comprehensive technical description of how the CAD software functioned (eg, how the classifier worked). We also extracted data relating to the evaluation strategy. We noted whether articles of primary studies were explicit regarding use of cross-validation, internal validation, or external validation (18,21). For studies in which cross-validation was apparently used (development and evaluation data were the same), we determined whether this information was explicit in the article. For studies in which other evaluation strategies were apparently used, we determined whether a clear distinction was made between development and evaluation data. We determined whether articles of primary studies were explicit regarding how a reference standard was achieved (to judge CAD against) and whether precise criteria for a true-positive CAD mark were described. We also noted whether human decision making was investigated.
Concerning data analysis and presentation, we noted if per-polyp and per-subject sensitivity and the false-positive rate for CAD were presented. We determined whether numerators and denominators were presented for per-polyp and per-subject analyses and whether receiver operating characteristic (ROC) graphs and analyses were available. Because ROC graphs incorporate sensitivity and specificity, we considered these estimates to have been reported whenever ROC graphs were encountered, but we made a separate assessment of whether individual numerators and denominators for sensitivity were presented.
Statistical Analysis
Raw frequencies for each of the details investigated were calculated. To present results graphically, we grouped extracted data into the five domains noted and scored each domain individually so that an indication of study quality (defined as methodologic excellence and completeness of data reporting) could be ascertained for each domain. Scores for primary studies were scaled so that they were directly comparable between domains (0 being worst and 100 best). Star plots, with the length of each spoke (longer being better) representing the score achieved by that study for each domain, were plotted (22) by using commercially available software (Stata 8.0; StataCorp, College Station, Tex). There was no attempt at meta-analysis, nor was meta-analysis ever intended (see Discussion).
| RESULTS |
|---|
|
|
|---|
Description of the CAD Algorithm
Authors of 21 (91%) studies provided a technical description of the CAD software. The star plot details how comprehensive this description was for each of those 21 primary studies (Fig 1a).
|
|
Cross-validation appeared to have been used in 12 (52%) studies, so development and evaluation data were necessarily the same. In two (9%) studies (16,30), authors indicated explicitly that internal validation was used, with randomization of a single data set used to create two separate development and evaluation data sets. In nine (39%) studies in which the evaluation strategy remained unclear (25–29,31–33,36), we could not make deductions about the nature of development data. Temporal validation might have been used in two studies (28,29), but in neither were the demographics of development data described; however, in one study (28), authors indicated explicitly that development data were "developed in an earlier study" and that "the patient population was different in that study." Overall, composition of the evaluation data set was described sufficiently well to be replicated by other researchers in only six (26%) studies (16,28,29,31,32,35).
Acquisition and Nature of Data for CAD Development and Evaluation
Both the CT detector-row collimation and milliampere-second values used to obtain evaluation data were described in 19 (83%) studies, with one of these parameters described in the four remaining studies (24,25,42,43).
Investigators in 22 (96%) studies stated that polyps or cancers were validated by means of endoscopy, while in the remaining study investigators indicated that polyps were "known" (42). Although the number or size range of polyps used for evaluation was indicated for 22 studies (96%), per-subject details were presented for only 11 (48%) (16,24,25,28,29,31,34,36,37,39,41).
Description of the Evaluation Strategy Used to Validate CAD Performance
Eleven (48%) primary studies clearly distinguished between development and evaluation data, either by means of an explicit statement that the same data were used for cross-validation (23,24,34,35,38–40,42,44) or by indicating explicitly that the data sets used for development and evaluation were different (16,30). The evaluation strategy could be deduced from the article in only 14 (61%) cases; cross-validation was used in 12 (52%) (23,24,34,35,37–44) and internal validation in two (9%) (16,30). In nine (39%) articles, the precise strategy used for evaluation was unclear (25–29,31–33,36).
The precise method used to establish a reference standard against which to judge CAD output was described for 12 (52%) studies (16,23,27–29,34–36,38,41,43,44). The conditions for a true-positive CAD prompt were described for nine (39%) studies (16,27–29,34–36,43,44) (eg, "detected polyps that were within 10 mm of their actual location were identified as true-positive" [36]).
Only two (9%) studies incorporated human observers; one (28) made an indirect assessment of CAD benefit by means of comparison with responses from unaided observers. One (29) assessed the effect of CAD on decision making directly. The articles for both these studies described the experience of the human observers adequately (28,29).
Presentation of Results
Only two (9%) articles (25,32) failed to present per-polyp sensitivity for CAD, but individual denominators and numerators were unavailable in a further five (22%) articles (23,24,27,38,43). Only eight (35%) articles presented per-subject sensitivity for CAD (16,28,29,34,36–39). Per-subject specificity for CAD was described in 21 (91%) articles; two had no estimate of the false-positive rate (31,32). A ROC analysis was presented in 13 (57%) articles (16,23,24,27–29,34,36–39,43,44).
Overall Summary
The star plots (Fig 1) indicate graphically the overall quality of the 23 primary studies. Overall quality varied widely. At least one primary study failed to score in each of the five domains assessed. The CAD description was the best-reported domain. Description of subjects was the domain with poorest reporting; only one-third of studies scored 50% or more, and the average score per study was 33%. Eight (35%) studies failed to score in this domain at all.
| DISCUSSION |
|---|
|
|
|---|
The CAD performance will be influenced heavily by the type of subjects in whom it is developed and evaluated, yet we found that the worst average domain score was for subject descriptions; eight (35%) primary studies failed to score at all. This finding is particularly important because software should be evaluated in subjects representative of those who might be expected to undergo the test in practice. Evaluating CAD in subjects with symptoms yet proposing a role in screening is an obvious example contrary to the foregoing. The strategy used to validate the software will also influence results (eg, internal cross-validation is likely to deliver more optimistic estimates than is external validation), yet the precise evaluation strategy used could not be deduced for 39% of primary studies, achieving the second lowest domain score overall. Presentation of results was also frequently incomplete. Other authors have stated that such information is desirable, (20) and to our knowledge our review quantifies these deficiencies objectively for the first time. The Standards for Reporting of Diagnostic Accuracy (46) initiative has stressed the importance of a full methodologic description so that readers can judge the potential for bias and appraise the generalizability of study findings. Likewise, the Quality Assessment of Studies of Diagnostic Accuracy Included in Systematic Reviews (47) initiative strives to assess the quality of primary studies included in systematic reviews of diagnostic tests.
This systematic review was precipitated by our experience that articles describing the performance of CAD software frequently omitted key methodologic details of the assessment and instead presented long and detailed explanations of the software algorithm (21 of the 23 articles we identified did so). However, details of the algorithm have no real effect on informed assessment of its clinical performance. Although an understanding of how the software works is interesting from a scientific perspective, we propose that whether it achieves its stated aim (ie, polyp detection with high sensitivity and specificity in day-to-day clinical practice) is more important to end users, most of whom will not be conversant with how the algorithms work or differ from one another. We therefore argue that full methodologic presentation best equips the reader to identify biases that could result in overly optimistic estimates of software performance. When we ignored technical descriptions of the software and focused on more relevant domains, only one study (16) achieved a maximum score. Although the study by Summers and co-workers (16) provided scant details of the algorithm, the descriptions of the data, evaluation strategy, analysis, and presentation were the highest achieved in our review, and this article is a good example of optimal study presentation. In the same way that Dachman and Zalis (48) proposed minimum standards for performing and reporting patient studies of CT colonography and Halligan and co-workers (49) proposed minimum standards for study-level reporting of CT colonography, we wish to propose, on the basis of this review, a minimum data set for study-level reporting of CAD in CT colonography (Fig 2).
|
In summary, results of our systematic review indicate that the reporting quality of studies of CAD for CT colonography is highly variable. Authors frequently omit key methodologic details needed for an informed assessment of the generalizability of results. More comprehensive data presentation is highly desirable, and we propose a minimum data set to help achieve this goal.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| FOOTNOTES |
|---|
Abbreviations: CAD = computer-aided detection ROC = receiver operating characteristic
Author contributions: Guarantor of integrity of entire study, S.H.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, all authors; statistical analysis, S.H., S.M., D.G.A.; and manuscript editing, all authors
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D. Regge, C. Hassan, P. J. Pickhardt, A. Laghi, A. Zullo, D. H. Kim, F. Iafrate, and S. Morini Impact of Computer-aided Detection on the Cost-effectiveness of CT Colonography Radiology, February 1, 2009; 250(2): 488 - 497. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Resmini, A. Tagliafico, L. Bacigalupo, G. Giordano, E. Melani, A. Rebora, F. Minuto, G. A. Rollandi, and D. Ferone Computed Tomography Colonography in Acromegaly J. Clin. Endocrinol. Metab., January 1, 2009; 94(1): 218 - 222. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Oates, D. J. M. Tolan, K. Horsthuis, S. Bipat, and J. Stoker Standard of Reporting for Studies on Inflammatory Bowel Disease Radiology, October 1, 2008; 249(1): 390 - 391. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |