|
|
||||||||
Special Reports |
1 From the Institute for Research in Extramural Medicine, VU University Medical Center, Van der Boechorststraat 7, 1081 BT Amsterdam, the Netherlands (N.S., D.A.W.M.v.d.W., R.W.J.G.O., L.M.B., H.C.W.d.V.); and Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Amsterdam, the Netherlands (A.W.S.R., J.B.R., P.M.B.). Received March 12, 2004; revision requested May 21; revision received August 19; accepted October 1. Supported by grants from the Medical SciencesNetherlands Organisation for Scientific Research (ZON-MW). Address correspondence to N.S. (e-mail: n.smidt@vumc.nl).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: English-language articles on primary diagnostic accuracy studies in 2000 were identified with validated search strategy in MEDLINE. Articles published in journals with impact factor of 4 or higher that regularly publish articles on diagnostic accuracy were selected. Two independent reviewers evaluated quality of reporting by using STARD statement, which consists of 25 items and encourages use of a flow diagram. Total STARD score for each article was calculated by summing number of reported items. Subgroup analyses were performed for study design (case-control or cohort study) by using Student t tests for continuous outcomes and
2 tests for dichotomous outcomes.
RESULTS: Included were 124 articles published in 2000 in 12 journals: 33 case-control and 91 cohort studies. Only 41% of articles (51 of 124) reported on more than 50% of STARD items, while no articles reported on more than 80%. A flow chart was presented in two articles. Assessment of reporting on individual items of STARD statement revealed wide variation, with some items described in 11% of articles and others in 92%. Mean STARD score (025 points available) was 11.9 (range, 3.519.5). Mean difference in STARD score between cohort studies and case-control studies was 1.53 (95% confidence interval: 0.24, 2.82).
CONCLUSION: Quality of reporting in diagnostic accuracy articles published in 2000 is less than optimal, even in journals with high impact factor. Authors, editors, and reviewers should pay more attention to reporting by checking STARD statement items and including a flow diagram to represent study design and patient flow.
Supplemental material: radiology.rsnajnls.org/cgi/content/full/2352040507/DC1
© RSNA, 2005
| INTRODUCTION |
|---|
|
|
|---|
In 1999, Lijmer et al (1) demonstrated that case-control studies with healthy control subjects led to overestimation of diagnostic accuracy, compared with that in cohort studies. Furthermore, knowledge of the results of the index test and the use of clinical information about the study population when interpreting the reference standard resulted in an overestimation of diagnostic accuracy (1). Therefore, complete and accurate reporting is essential to judge the potential for bias and to assess the generalizability of results.
The first checklist for reporting of diagnostic accuracy studies was published by Bruns et al (5) in October 2000. In January 2003, guidelines for reporting studies of diagnostic accuracy (the Standards for the Reporting of Diagnostic Accuracy, or STARD) were published simultaneously in eight medical journals (Radiology, American Journal of Clinical Pathology, Annals of Internal Medicine, British Medical Journal, Clinical Biochemistry, Clinical Chemistry, Clinical Chemistry of Laboratory Medicine, and Lancet) (6,7). Similar guidelines for the reporting of randomized controlled trials (the Consolidated Standards for Reporting of Trials, or CONSORT), systematic reviews (the Quality of Reporting of Meta-analyses, or QUORUM), and observational studies (the Meta-analysis of Observational Studies in Epidemiology, or MOOSE) already exist (810).
After publication of the CONSORT statement, Moher et al (11) evaluated the quality of reports of 211 randomized controlled trials published in British Medical Journal, JAMA, Lancet, and the New England Journal of Medicine by using the CONSORT checklist. They concluded that the use of the CONSORT statement is associated with improvements in the quality of reports of randomized controlled trials (11). The presentation of a flow diagram was also associated with improved quality of reporting of randomized controlled trials (12).
Although Reid et al (4) had pointed out the poor quality of reporting in the 1990s, it is possible that the reporting has improved in more recent articles. Therefore, this study was designed to evaluate the quality of reporting in articles on diagnostic accuracy published in 2000 in journals with an impact factor of at least 4 by using the items of the STARD statement published later in 2003.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Study Selection
Articles were included if they reported on primary studies of diagnostic accuracy, in which the results of one or more tests were compared with the findings obtained with a reference standard in the same study population. Two reviewers (N.S., A.W.S.R.) independently assessed the title, abstract, and keywords of all eligible articles to determine whether they met the inclusion criteria. If there was any doubt, the full text of the article was retrieved and read by both reviewers. Disagreements were discussed and resolved in a consensus meeting.
Data Extraction
The STARD statement was used to assess the quality of reporting. The statement contains a list of 25 items and encourages the use of a flow diagram to represent the design of the study and the flow of patients through the study (6,7). For this assessment, the reviewers had to determine whether each item of the checklist was described adequately in the text. Note that the reviewers were not evaluating the likelihood of bias but only the quality of reporting. Two reviewers independently evaluated the quality of reporting in the included articles. One reviewer (N.S.) assessed all articles, and four other reviewers (A.W.S.R., H.C.W.d.V., D.A.W.M.v.d.W., R.W.J.G.O.) each evaluated a quarter of all the articles. Disagreements were discussed and resolved in a consensus meeting. If consensus could not be reached, a third reviewer made the final decision.
Statistical Analysis
For each item in the STARD statement, the total number of articles reporting the elements mentioned in that item is presented. A total STARD score for each article was calculated by summing the number of reported items (025 points available). Higher scores indicated better quality of reporting. Equal weights were applied to each of the items. Six items (items 8, 9, 10, 11, 13, and 24) concern the index tests, as well as the reference standard. Weights for these items were assigned to both the index test (0.5 point) and the reference standard (0.5 point) and evaluated separately. The overall mean and standard deviation of the total STARD scores are presented.
Subgroup analyses were performed to compare the quality of reporting among different journals and designs (case-control and cohort studies). Cohort studies are characterized by selection of subjects who underwent the index test, whereas in case-control studies, the subjects are selected on the basis of the results of the reference standard (14). Student t tests (independent samples) were used to calculate mean differences between the total STARD score of case-control and cohort studies. In addition,
2 tests were used to calculate differences between the number of articles reporting the items of the STARD statement in case-control and cohort studies. If the assumptions of the
2 tests were not met, the Fisher exact test was used. Differences in total STARD scores between the 12 journals were calculated by means of pairwise comparisons (Tukey honestly significant difference test). P values of less than .05 were considered to indicate a statistically significant difference. Statistical analysis was performed (N.S.) by using SPSS for Windows (release 11.0.1; SPSS, Chicago, Ill).
| RESULTS |
|---|
|
|
|---|
|
|
Quality of Reporting in Diagnostic Articles
Interrater agreement on the items of the STARD statement was good (overall agreement, 81.3%;
statistic, 0.62). In six articles, disagreements between two reviewers could not be resolved, and the decision was made by one of the other reviewers. Most disagreements were caused by poor reporting of the design or doubts about the identity of the index and/or reference test. The time needed to perform the quality assessment was approximately 1 hour for each article.
Overall, the items of the STARD statement were poorly reported. The mean STARD score of the 124 articles was 11.9 (standard deviation, 3.3). Only 41% (51 of 124) of the articles reported more than 50% of the items (STARD score
12.5), and none of the them reported more than 80% (STARD score
20). A flow chart was reported in only two articles (2%). The quality of the reporting of the items of the STARD statement for each article separately is presented in the online Appendix E1 (radiology.rsnajnls.org/cgi/content/full/2352040507/DC1; for further information, contact N.S. at n.smidt@vumc.nl).
STARD Statement
The overall quality of the reporting of the items of the STARD statement in the articles is presented in Table 2. There is a broad variation in the quality of the reporting of these items (11%92%). Poorly (<20%) reported items were (a) identification of the article as a study of diagnostic accuracy (item 1), (b) methods used for calculating or comparing measures of diagnostic accuracy (item 12), (c) methods used for calculating test reproducibility (item 13), (d) adverse events from performing the test(s) (item 20), and (e) estimates of test reproducibility of the reference standard (item 24b). The best reported item was discussion of the clinical applicability of the study findings (item 25). For each section (title, abstract, and keywords; introduction; methods; results; and discussion) of the STARD statement, the most remarkable findings are discussed as follows.
|
The STARD statement recommends the use of the Medical Subject Headings (MeSH) term sensitivity and specificity. In this search, 686 (78%) of the 884 articles were identified by this MeSH term. However, only 100 of the 686 articles actually concerned a diagnostic accuracy study (positive predictive value, 15%). Nevertheless, the sensitivity of this search term was high, with 81% (100 of 124) of the included articles being identified correctly in MEDLINE.
Introduction (item 2).In 90% of all articles (112 of 124), the research question became clear after reading the abstract and introduction (item 2). However, information regarding the index tests, the reference standard, and the target condition was scattered throughout the text. Only 32% of the articles (40 of 124) mentioned the index test, the reference standard, and the target condition in their research question. In many articles, the reference standard was lacking in the formulation of the research question (64%, 79 of 124).
Methods (items 313).Only 28% of all articles (35 of 124) reported the inclusion and exclusion criteria, the setting, and the location where the data were collected (item 3). This low percentage was mainly due to the absence of exclusion criteria (69 of the 124 articles [56%]). The inclusion criteria were relatively well reported (108 of 124; 87%), but only 56% of the articles (70 of 124) reported how patients were selected (item 5). A consecutive series of patients was apparently included in 36% of the studies [45 of 124]). The reference standard and its rationale were reported clearly in 57% of the articles (item 7). In 40% of the articles (50 of 124), only the reference standard was reported, while in four articles (3%), the identity of the reference standard remained unclear. Information concerning the index test was better reported than that for the reference standard (items 813 and 24). In particular, information regarding the number and training of the persons executing and evaluating the reference test(s) and the blinding of the readers to the tests was reported poorly (items 10 and 11).
Only 37% of the articles (46 of 124) clearly reported whether the results of the reference standard and clinical information about the study population were given to the readers of the index test (item 11a). In most articles (62%, 77 of 124), information regarding the revelation of clinical information about the study population to the readers of the index test was lacking. If it was reported clearly that the index test was performed before the reference test, we assume that the readers of the index test had been blinded to the results of the reference test. Information regarding the revelation of the results of the index test, other tests, or clinical information about the study population to the readers of the reference standard was reported in only 18% of the articles (23 of 124) (item 11b).
The methods for calculating measures of diagnostic accuracy, such as sensitivity, specificity, likelihood ratios, diagnostic odds ratios, and receiver operating characteristic curves, were reported in 65% of the articles (81 of 124). Only 14% of the articles (17 of 124) adequately reported the statistical methods used to calculate measures of diagnostic accuracy, particularly with regard to the quantification of estimates of the diagnostic accuracy (eg, 95% confidence limits, item 12). Methods used to study the reproducibility of the index test and the reference standard were reported poorly, by only 16% (20 of 124) and 5% (six of 124) of the articles, respectively (item 13). Six articles (5%) referred to previous research on the reproducibility of the test(s).
Results (items 1424).Clinical and demographic characteristics, such as age and sex of the study population and the spectrum of the symptoms at presentation, were reported clearly in 52% of the articles (65 of 124, item 15). Less frequently reported clinical characteristics were co-morbidity (20 of 124, 16%) and current treatments (33 of 124, 27%).
Eighty-three percent of the articles (103 of 124) reported the number of participants who met the inclusion criteria and those who did or did not undergo the index test and reference standard. Seventy-five (60%) articles explained why participants failed to undergo one or more of the tests (item 16). In 43 of the 75 articles, however, none of the participants failed to undergo the index test or reference standard. A flow diagram, describing the design of the study and the number of participants, was presented in only two articles (2%).
Information about the time interval between the index test and the reference standard and about the treatment administered between the tests was given in 33 (27%) articles (item 17). Twenty-two of these 33 articles did not report on the treatment between the tests, but the time interval between the tests was so small that treatment could not have affected the results of the second test.
Although 109 of 124 articles reported estimates of diagnostic accuracy (eg, sensitivity and specificity), 29 of these gave no information about the number of true-positive, true-negative, false-positive, and false-negative findings. Thirty-two percent of the articles (40 of 124) reported statistical uncertainty (ie, 95% confidence intervals) for the measures of diagnostic accuracy (item 21).
Discussion (item 25).Most articles (114 of 124, 92%) discussed the clinical applicability of the study findings. In addition to scoring the items of the STARD statement, the reviewers were asked to compose a 2 x 2 table for each article. This was possible for 73% of the articles (91 of 124). However, true-positive and true-negative findings often had to be deduced from the results of sensitivity and specificity, which implied that the number of indeterminate or missing results had to be ignored in the reconstruction of the 2 x 2 table.
Subgroup Analysis
Results of subgroup analyses showed that the quality of reporting for case control studies was not as good as that for cohort studies (Table 2). The mean STARD score ± standard deviation was 12.4 ± 3.0 for the 91 cohort studies and 10.8 ± 3.7 for the case-control studies. The mean difference in STARD score between cohort studies and case-control studies was 1.5 (95% confidence interval: 0.2, 2.8). Large differences (
15%) in the quality of reporting between cohort and case-control studies were found for the following items: (a) participant sampling (item 5); (b) definition of and rationale for the units, cutoffs, and/or categories of the results of the reference standard (item 9b); (c) the number, training, and expertise of the persons executing and evaluating the tests (items 10a and 10b); (d) recruitment period (item 14); (e) time interval between the index tests and the reference standard and any treatment administered between the tests (item 17); (f) adverse events of the tests (item 20); (g) how indeterminate results, missing data, and outliers of the index tests were handled (item 22); and (h) estimates of reproducibility of the index test (item 24a). Statistically significant differences (P < .05) between case-control and cohort studies were found for the following items: participant sampling (item 5); number, training, and expertise of the persons executing and evaluating the reference standard (item 10b); and the handling of indeterminate results (item 22). Only 27% of the case-control studies (nine of 33) adequately reported on at least 50% of the items, while 46% of the cohort studies (42 of 91) reported on more than 50% of the items.
Mean STARD score and standard deviations are presented for each journal in Table 3. The mean STARD score varied from 9.8 in the British Medical Journal to 15.5 in JAMA. However, none of the pairwise comparisons were statistically significant.
|
| DISCUSSION |
|---|
|
|
|---|
First, we strongly recommend the use of a flow diagram, because for most of the articles, the reviewers had to spend a considerable amount of time identifying the index test and the reference standard, the sequences of performing these tests, and the number of patients who underwent each test. Second, accurate identification of articles on diagnostic accuracy in the literature is important, and therefore, the use of uniform terms (MeSH headings) in keywords, titles, or abstracts is important. Just as clinical trials are labeled as a specific type of publication in MEDLINE (PubMed), studies on diagnostic accuracy should also be labeled as a specific type of publication. The STARD group proposed systematic use of the MeSH term sensitivity and specificity, because this is indicative of a study on diagnostic accuracy and is a term that has been used frequently in the past. Moons and Harrell (15) suggested use of the term posttest probability, because studies on diagnostic accuracy do not necessarily have to determine sensitivity and specificity. However, posttest probability is not yet registered as a MeSH term. We recommend the use of diagnostic accuracy as publication type, and posttest probability should be included as a new MeSH term, in addition to sensitivity and specificity.
The STARD statement focuses on the quality of reporting, not the methodologic quality of a diagnostic study. For example, if the authors stated that the reviewers of the reference standard were not blinded to the results of the index test, we considered item 11 to be well reported, even though this indicates a potential methodologic shortcoming. We believe that there is a positive association between the methodologic quality of a study and the quality of reporting. It is easier to report on a well-performed study than on a study that was poorly designed or in which a large number of protocol deviations occurred. Moreover, in the latter case, the authors may be less inclined to report in detail what happened. Increased attention to the quality of reporting and strict requirements for reporting in journals might, in the long term, thus also improve the methodologic quality of diagnostic research.
Lijmer et al (1) showed that various methodologic characteristics of a diagnostic study might influence the results of diagnostic accuracy. Their analysis was hampered by the poor reporting in many studies. Improved reporting may lead to better estimation of the influence of methodologic characteristics on diagnostic accuracy. Moreover, better estimates of biases or sources of variation within diagnostic studies can be made if all STARD items are reported. The STARD guidelines are not the first to focus on the reporting of studies. CONSORT, QUORUM, and MOOSE have emphasized the importance of better reporting of other study designs (810).
The quality of reporting in articles on diagnostic accuracy is of great importance for assessing the generalizability of the results. It is also essential for the detection of methodologic flaws, the recalculation of sensitivity and specificity, repetition of the study, and application of the results in clinical practice. Fortunately, a number of journals have already changed their instructions to authors and require authors to complete the STARD checklist and to include a flow diagram that represents the design of the study and the flow of patients.
Our study has a few limitations. First, the identification of studies of diagnostic accuracy is difficult. We searched MEDLINE by using a validated search strategy to identify all studies on diagnostic accuracy published in 2000. However, the search strategy has a sensitivity of 80.0% and a specificity of 97.3% (13). Therefore, we may have missed studies on diagnostic accuracy that were not identified with our search strategy.
Second, the generalizability of the results of this study may be questioned. We evaluated the quality of reporting of studies on diagnostic accuracy published in 2000 in 12 journals. For this purpose, journals were selected if they occurred in the top-50 ranking of journals that frequently publish articles on diagnostic accuracy and if they had an impact factor of at least 4. However, it remains unclear whether results would be similar for journals that only rarely publish diagnostic accuracy studies or for journals with an impact factor of less than 4.
Furthermore, as almost 50% of all identified articles on diagnostic accuracy were published in Radiology, we decided to limit the number of articles published in Radiology to 25. As the quality of reporting could have been improved during the year, we selected the first two articles of each month and the first three articles published in the December 2000 issue. In our opinion, the quality of reporting of those articles not selected for the review will be similar to the selected articles.
We strongly recommend that authors, editors, and reviewers use the STARD statement for preparing, writing, and reviewing articles on diagnostic accuracy. We also stress that special attention should be paid to the identification of the article as a work pertaining to diagnostic accuracy and that a flow diagram should be included to represent the design of the study and the flow of patients. Hopefully this will lead to an improvement in the quality of reporting in the near future.
| FOOTNOTES |
|---|
Authors stated no financial relationship to disclose.
Author contributions: Guarantors of integrity of entire study, H.C.W.d.V., L.M.B.; study concepts and design, H.C.W.d.V., P.M.B., L.M.B., J.B.R.; literature research, N.S., A.W.S.R.; data acquisition, N.S., A.W.S.R.; data analysis/interpretation, N.S., H.C.W.d.V., A.W.S.R., R.W.J.G.O., D.A.W.M.v.d.W.; statistical analysis, N.S.; manuscript preparation and definition of intellectual content, N.S., H.C.W.d.V.; manuscript editing, N.S.; manuscript revision/review and final version approval, all authors
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
N. Smidt, J. Overbeke, H. de Vet, and P. Bossuyt Endorsement of the STARD Statement by Biomedical Journals: Survey of Instructions for Authors Clin. Chem., November 1, 2007; 53(11): 1983 - 1985. [Full Text] [PDF] |
||||
![]() |
J. Raymond and I. Trop The Practice of Ethics in the Era of Evidence-based Radiology Radiology, September 1, 2007; 244(3): 643 - 649. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Hollingworth and J. G. Jarvik Technology Assessment in Radiology: Putting the Evidence in Evidence-based Radiology Radiology, July 1, 2007; 244(1): 31 - 38. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. R. van Beek and D. E. Malone Evidence-based Practice in Radiology Education: Why and How Should We Teach It? Radiology, June 1, 2007; 243(3): 633 - 640. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Dodd Evidence-based Practice in Radiology: Steps 3 and 4--Appraise and Apply Diagnostic Radiology Literature Radiology, February 1, 2007; 242(2): 342 - 354. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Leeflang, J. Reitsma, R. Scholten, A. Rutjes, M. Di Nisio, J. Deeks, and P. Bossuyt Impact of Adjustment for Quality on Results of Metaanalyses of Diagnostic Accuracy Clin. Chem., February 1, 2007; 53(2): 164 - 172. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Roposch, N. M. Moreau, E. Uleryk, and A. S. Doria Developmental Dysplasia of the Hip: Quality of Reporting of Diagnostic Accuracy for US Radiology, December 1, 2006; 241(3): 854 - 860. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Smidt, A.W.S. Rutjes, D. A.W.M. van der Windt, R. W.J.G. Ostelo, P. M. Bossuyt, J. B. Reitsma, L. M. Bouter, and H. C.W. de Vet The quality of diagnostic accuracy studies since the STARD statement: has it improved? Neurology, September 12, 2006; 67(5): 792 - 797. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Mallett, J. J Deeks, S. Halligan, S. Hopewell, V. Cornelius, and D. G Altman Systematic reviews of diagnostic tests in cancer: review of methods and reporting BMJ, August 26, 2006; 333(7565): 413. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Shunmugam and A. Azuara-Blanco The quality of reporting of diagnostic accuracy studies in glaucoma using the heidelberg retina tomograph. Invest. Ophthalmol. Vis. Sci., June 1, 2006; 47(6): 2317 - 2323. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. FitzGerald Performance-Based Assessment of Radiology Faculty Am. J. Roentgenol., January 1, 2006; 186(1): 265 - 265. [Full Text] [PDF] |
||||
![]() |
A. W.S. Rutjes, J. B. Reitsma, J. P. Vandenbroucke, A. S. Glas, and P. M.M. Bossuyt Case-Control and Two-Gate Designs in Diagnostic Accuracy Studies Clin. Chem., August 1, 2005; 51(8): 1335 - 1341. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |