|
|
||||||||
Special Report |
1 From the Department of Orthopaedic Surgery, Great Ormond Street Hospital for Children, Institute of Child Health, University College London, Great Ormond St, London WC1N 3JH, England (A.R.); Population Health Sciences Research Institute (N.M.M., A.S.D.) and Department of Diagnostic Imaging (A.S.D.), the Hospital for Sick Children, Toronto, Ontario, Canada; the Hospital for Sick Children Library, Toronto, Ontario, Canada (E.U.); and Department of Medical Imaging, University of Toronto, Ontario, Canada (A.S.D.). Received August 15, 2005; revision requested October 19; revision received November 2; accepted December 1; final version accepted February 1, 2006. Address correspondence to A.R. (e-mail: a.roposch{at}ich.ucl.ac.uk).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: A systematic review of the MEDLINE, EMBASE, DARE, and Cochrane Library databases was performed by using a validated search strategy. Two independent reviewers evaluated articles by using the Standards for Reporting of Diagnostic Accuracy (STARD) and Quality Assessment of Studies of Diagnostic Accuracy included in Systematic Reviews (QUADAS) statements. Items were reported individually for STARD and QUADAS because these instruments do not incorporate a summary score. A simple
statistic with 95% confidence intervals was used to measure the level of agreement between the two reviewers.
Results: Ten studies were included. In three studies, reliability was investigated, and in seven studies elements of both validity and reliability were investigated. In no study did the authors adequately report more than 40% of the STARD items. The quality of methods that were used in the studies was poor. Only one (14%) of seven studies provided information on more than 50% of the QUADAS items. All studies included a good description of image acquisition, but data analysis was imperfect and lacked estimates of diagnostic accuracy and precision. Authors tended to overinterpret their results.
Conclusion: Overall, there was imperfect reporting of diagnostic accuracy in studies on the use of US for diagnosis of DDH.
Supplemental material: radiology.rsnajnls.org/cgi/content/full/2413051358/DC1
© RSNA, 2006
| INTRODUCTION |
|---|
|
|
|---|
To produce results that have validity (the degree to which a US scan corresponds to the true state of the hip) and reliability (the degree to which the same result is found when the hip is scanned on two different occasions), it is important that US methods for examination of the hip fulfill the basic diagnostic test standards (Fig 1). In recent decades, quality assessment of diagnostic tests has gained increasing interest. Complete and accurate reporting is essential to judge the generalizability of the results and the potential for bias. It has been suggested that the general quality of reporting in diagnostic test studies is poor (25), with an overestimation of diagnostic accuracy (2). In a recent systematic review, Woolacott et al (6) assessed the role of US for screening policies in patients with DDH; however, to our knowledge no previous systematic review has been conducted to investigate the current status of knowledge on measurement properties (ie, reproducibility and validity) for US of the hip. Thus, the purpose of our study was to systematically review the quality of diagnostic accuracy reporting in studies on the use of US for the diagnosis of DDH.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Two reviewers (A.R., A.S.D.) independently assessed the titles and key words of all eligible citations to determine if the studies met our inclusion criteria. If the content of a study was not obvious from the title and key words, the abstract was retrieved and evaluated by both reviewers for eligibility. In the second step, all abstracts of articles that were found to be eligible for inclusion were reviewed independently by the same two reviewers. Finally, the original studies of the selected articles were evaluated independently (A.R., A.S.D.). At any stage, disagreements were discussed and resolved in a consensus meeting before the next step could be performed.
Inclusion Criteria
This systematic review included studies on the measurement properties (diagnostic accuracy) of any US methods that were reported for the diagnosis of DDH in neonates (age 04 weeks), infants (112 months), or older children (1324 months). Specifically, we included studies in which at least one of the following criteria for single methods were described: process criteria (ie, how to perform the US examination) and, if applicable, conversion criteria (ie, how to interpret a US scan). Also included were studies in which any form of reliability (reproducibility) or validity was investigated. Reliability is defined as obtaining the same result when a phenomenon is measured by the same or different clinicians on the same or different occasions (7). Validity is the degree to which the result of a measurement corresponds to the true state of the phenomenon being measured (7).
Excluded were studies with inappropriate reference standards, such as studies on the correlation between clinical examination and US or diagnostic yield studies (eg, studies on US screening policies for DDH). Clinical examination cannot be considered a reference standard to determine the diagnostic accuracy of US because it does not provide an equivalent or superior amount of information compared with US. Articles written in languages other than English, French, German, Italian, Spanish, and Portuguese were excluded.
Data Extraction and Outcome Measures
The Standards for Reporting of Diagnostic Accuracy (STARD) statement (8), which was developed to improve the reporting of studies on diagnostic accuracy, was used to assess the quality of reporting. This evaluative tool contains 25 items. The reviewers independently evaluated each study to determine whether each item was adequately described according to the STARD guidelines (8); items were rated as adequately described, not described, or partially described. Disagreements were resolved by consensus. The STARD statement does not incorporate a quality score, and items are reported individually (8).
For each study, the quality of methods was assessed by using the Quality Assessment of Studies of Diagnostic Accuracy Included in Systematic Reviews (QUADAS) criteria, which is a 14-item instrument (9). For each item, the two reviewers independently assessed whether the elements that were mentioned in that item were adequately described (yes or no). If it was unclear from the information provided in the article as to whether an item had been addressed in a particular study, the item was rated as "unclear." Disagreements were discussed and resolved by consensus. Reliability studies were not assessed with the QUADAS criteria because only two items were applicable. Similar to the STARD tool, QUADAS does not incorporate a total quality score (10).
Statistical Analysis
The level of agreement between the two reviewers in scoring the STARD and QUADAS criteria was assessed by using a simple
statistic with 95% confidence intervals (SAS, version 9.1; SAS Institute, Cary, NC).
| RESULTS |
|---|
|
|
|---|
|
= 0.83 [95% confidence interval: 0.77, 0.89]). Disagreements (n = 21) between the two reviewers were resolved in all cases and were the result of poor reporting of the study design and a vague description of key issues, such as the number of participants that satisfied the inclusion criteria (item 16) or the time interval between index and reference tests (item 17). Overall, all items in the STARD statement were poorly reported (Appendix E1 [http://radiology.rsnajnls.org/cgi/content/full/2413051358/DC1]). Only one (9%) of 11 studies (14) included the medical subject heading term "sensitivity and specificity" (item 1). The objective of the study was reported in all but one article. Study objectives were heterogeneous and included a description of US methods for examination of the hip, evaluation of the correlation between US of the hip and radiography, and assessment of reproducibility (item 2).
Participants.Only four (36%) of 11 articles included information on the study population; however, no article specifically provided the inclusion and exclusion criteria of the study. Only three (27%) of 11 studies included some information on the recruitment process. In three studies (27%), apparently consecutive cases were sampled, and in two studies (18%) a random sample was used. Data collection was performed prospectively in two studies (18%) and retrospectively in three studies (27%); data collection was not specified for the remaining studies.
Reference standard.In all seven validity studies, radiography was used as the reference standard. However, it was not specifically stated what kind of radiographic techniques were used or when the radiographs were obtained in relation to the US examination. Among several radiographic measures, the acetabular index was the most consistently used (four studies). However, the rationale for using this particular measure as a reference standard was not stated in any study.
Test methods (items 811).Process criteria were consistently well reported for all studies. However, the rationale for using specific units, cutoff values, or categories of US results was not stated in any study. Time intervals between US and radiography were not reported in any study, and the distribution of the severity of disease in patients with DDH was reported in only two (18%) of 11 studies. Only one (9%) of 11 studies provided explicit information on the number and expertise of the persons who executed and read the index and reference standard tests (12).
Statistics (items 1213).In two (18%) of 11 studies, Pearson correlation coefficients for US to radiography were calculated as a measure of concurrent validity. In four (36%) of 11 studies, reproducibility was described in terms of the mean difference between single measurements, whereas in three studies (27%) standard deviations or confidence intervals were included. Sound statistical methods for the assessment of reproducibility were used in only three studies (27%) (11,12).
Time frame of study (item 14).Only four (36%) of 11 studies included information on when the study had been performed.
Characteristics of participants (item 15).Overall, there was insufficient reporting of the clinical and demographic characteristics of participants, with only four (36%) of 11 studies providing sufficient information (1416,19).
Test results (items 1720).In studies on diagnostic accuracy, none of the authors reported their results in a 2 x 2 table for the entire sample nor did they define diseased verses nondiseased cases or true-positive versus true-negative findings. In one (14%) of seven studies, absolute numbers were not reported (15); in a second study (14%), only the results of a subsample were reported (16); and in a third study (14%), the definitions for single disease states (eg, residual dysplasia, dysplasia, and subluxation) were not specified, thereby making it unclear as to what the true-positive and false-positive rates were.
Estimates (items 2124).Precision values, such as 95% confidence intervals, were not reported for the estimates of diagnostic accuracy in any study. Possible sources of heterogeneity in the results were not explored in any study. Interestingly, indeterminate results, such as hips that were dysplastic but not dislocated, were not reported in any study that assessed the validity of US. As for test reproducibility, Bar-On et al (11) reported an interrater reliability (mean
) of 0.50 (95% confidence interval: 0.45, 0.55) and an intrarater reliability of 0.61 (95% confidence interval: 0.49, 0.69) for the Graf method of US of the hip. These estimates were based on the rating of US scans by three pediatric orthopedic surgeons. In the study by Ömeroglu et al (12), estimates were not significantly different among the four groups of raters with different levels of expertise according to the Graf method; the best interrater and intrarater reliability results (
coefficient) were 0.36 ± 0.06 (standard deviation) and 0.62 ± 0.18, respectively.
Of the remaining seven studies, four (57%) contained inter- and intrarater errors for US methods, including means and standard deviations, and four (57%) did not include an evaluation of the reproducibility of methods. Dias et al (20) investigated the reliability of several US parameters and showed a wide range of intra- and interrater reliability, with
coefficients ranging from 0.46 ± 0.24 to 0.68 ± 0.19 and from 0.09 ± 0.38 to 0.27 ± 0.25, respectively. They also reported intraclass correlation coefficients for
and ß angles, which were better for intrarater reliability (0.69 and 0.78, respectively) than for interrater reliability (0.65 and 0.11, respectively).
Quality of Methods
Seven studies were assessed by using the QUADAS tool (Appendix E2, [http://radiology.rsnajnls.org/cgi/content/full/2413051358/DC1]). Approximately 45 minutes were required to evaluate each article with the QUADAS instrument. The QUADAS criteria were investigated after all studies had been evaluated with the STARD statement. Interrater agreement on the items of the QUADAS instrument was very good (
coefficient, 0.87 [95% confidence interval: 0.78, 0.96]). Disagreements between the two reviewers could be resolved in all cases. Disagreements resulted from a vague description of key information regarding the sampling of the patients (item 1), the selection of patients who underwent verification with radiography (item 5), and the timing of US and radiography (item 7).
Overall, the quality of methods used in these studies was poor, with information on image acquisition (item 8) reported consistently among all studies. Radiography was used as the reference standard in all studies (item 3) but was not consistently performed in all study subjects and did not result in the correct classification of the target condition in all studies.
In six studies, clinical data were available during the interpretation of the US examination because such data would also be available in clinical practice (item 12). In four studies (57%), the spectrum of patients was found to be representative of patients who would undergo the test in practice (item 1). In four studies (57%), it was clearly reported whether the whole sample or only a part of the sample underwent radiography (item 5). Selection criteria (item 2), the time period between US and radiography (item 4), missing information at radiograph acquisition (item 8), and no reporting on potential problems with the interpretation of US findings (item 13) were among the major methodologic flaws (Appendix E2, [http://radiology.rsnajnls.org/cgi/content/full/2413051358/DC1]).
| DISCUSSION |
|---|
|
|
|---|
Overall, the results of the present study indicate that diagnostic test standards for US of the hip in the context of DDH are poorly established. We identified only seven studies in which the authors addressed any aspect of validity for three different US methods and only four studies in which the authors assessed reliability. In none of the accuracy studies did the authors adequately report more than 40% of the STARD items. Also, the quality of methods in studies on the validity of US of the hip was poor. Only one (14%) of seven studies included information on more than 50% of the QUADAS items (14).
Studies on the validity of US of the hip had one positive commonality in that they provided an appropriate description of image acquisition. However, the formulation of the research question and the rationale for using radiography as a reference standard (criterion validity) were poor because the expected relationship between US and radiography (construct validity) was not formulated at all. Overall, the validity of single methods remains unclear. There was poor definition of diseased and nondiseased cases throughout the studies, and 2 x 2 tables were not reported. A consistent finding among all studies was that dislocation on US scans correlated well with radiographic results, as did normal hips. The area between these two extremes was considered dysplasia but was not well defined. As a result, measures of test accuracy, such as sensitivity and specificity, remained unclear. Interestingly, of the seven validation studies, only one (14%) was performed prospectively, which partially justifies the poor quality of methods for studies included in this review.
Better design and reporting were found in the reliability studies. In three studies, sound methods were applied to establish the reliability of US of the hip according to the Graf method (11,12). Reliability was found to be poor (20) to moderate (11,12), and further studies were recommended to improve it.
Reliability has been assessed for other US methods as well. However, substantial flaws in design and analysis were noted. There was a lack of information regarding the profession, training, and expertise of the raters; the time frame of the study; and the sample size calculation. Estimates of reproducibility were given as means and standard deviations for continuous variables, without accounting for chance agreement between the two ratings. Without the use of sound methods, the generalizeability and interpretation of results remains unclear. For instance, Andersson (13) based his reliability assessment on only five cases per rater and recommended a "visual analysis with a more global approach" for US scans.
The STARD initiative recommended that authors of diagnostic studies should consistently use the medical subject heading term "sensitivity and specificity" to facilitate retrieval of their studies. We support this concern because our review confirmed that the exclusive use of these medical subject heading terms limits the search for studies (3,26). Only 47 (12%) of the 396 citations that were identified with our search strategy were obtained by using this medical subject heading term.
We agree with others (4,8) and recommend the use of flow diagrams to illustrate the methods used in diagnostic accuracy studies. In agreement with the findings of Smidt et al (4), the reviewers in the present study had to spend a considerable amount of time identifying the sampling frame, determining the sequence in which the tests were performed, and identifying the number of individuals who underwent these tests. Flow diagrams would have been useful to identify these issues more efficiently.
Several major methodologic flaws were identified by using the QUADAS instrument, including sampling bias, verification bias, and imperfect analysis, all of which were related to the use of inappropriate statistical methods. There was a tendency to overinterpret results. For example, the author of one study concluded that image acquisition and interpretation were easy to learn but did not provide any data to support this conclusion (13). In another study (16), normal and dislocated hips correlated well at US and radiography but dysplastic hips did not. Still, the authors concluded, "US was reliable in detecting hip pathology."
This study was not conducted to investigate the role of US for DDH screening. This issue falls within diagnostic yield studies, which were not the focus of the present systematic review. In a recently published systematic review, Woolacott et al (6) concluded that there is a lack of evidence either for or against US screening of newborns for DDH. Although we did not investigate the issue of screening, the results of our study correspond with those of Woolacott et al, which is plausible because diagnostic accuracy is a basic standard and prerequisite of any diagnostic test. If basic standards are not met, then the application of the test for screening purposes will likely demonstrate poor results as well.
A limitation of our study was the difficulty in identifying diagnostic accuracy studies of US methods for examination of the hip. We used a validated search strategy (3), which has a sensitivity of 80% and a specificity of 97%. Thus, we may have missed studies. However, we can be certain that for the three methods of US examination of the hip that were included (14,16,27), all of the eligible studies were included.
The results of our study raise interesting issues. Although basic diagnostic test standards are not met by commonly used methods of US of the hip, these methods have become standard in clinical care and have even been applied in population screening programs (28). Regarding US for the diagnosis of DDH, there is poor evidence for the diagnostic accuracy and benefit of the diagnostic test in terms of outcome (6).
Considering that there has been a shift toward evidence-based and cost-effective heath care and that clinicians are ordering many more diagnostic tests now than in the past, physicians may be required to more rigorously justify the benefit of tests for individual patients. This may involve justifying how the test result may change clinical decision making, how it may improve the likelihood of a correct diagnosis, and how it may improve clinical outcomes.
An overuse of diagnostic investigations in general has been suggested (2,29), and a more efficient use was recommended. Thus, on the basis of the results of our study we see a clear need for further investigation of the diagnostic accuracy of US of the hip. Establishing sufficient diagnostic accuracy is the prerequisite for diagnostic yield studies, such as studies on the role of US for DDH screening.
We suggest that three main elements should be considered in future studies. First, a clear description of the participants and the sampling frame is essential. The lack of information on demographic characteristics and inclusion and exclusion criteria compromises the interpretation of the results.
Second, radiography seems to be an adequate reference standard. However, the rationale for choosing the reference standard has to be stated, as well as the kind of relationship the investigators expect a priori between US and the reference standard.
Third, the test results should be reported with regard to the a priori assumptions. Cutoff values to distinguish between diseased and nondiseased cases are essential to cross-tabulate the results and to calculate estimates of accuracy with confidence limits, where applicable.
We found the STARD criteria useful in the context of this systematic review. They provided a good framework to assess studies on diagnostic accuracy. We recommend use of the STARD criteria not only for evaluating but also for preparing and drafting studies on diagnostic accuracy.
| ADVANCE IN KNOWLEDGE |
|---|
|
|
|---|
| FOOTNOTES |
|---|
Abbreviations: DDH = developmental dysplasia of the hip QUADAS = Quality Assessment of Studies of Diagnostic Accuracy Included in Systematic Reviews STARD = Standards for Reporting of Diagnostic Accuracy
Authors stated no financial relationship to disclose.
Author contributions: Guarantor of integrity of entire study, A.R.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, all authors; experimental studies, A.R., A.S.D.; statistical analysis, A.R.; and manuscript editing, all authors
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Roposch and J. G. Wright Increased Diagnostic Information and Understanding Disease: Uncertainty in the Diagnosis of Developmental Hip Dysplasia Radiology, February 1, 2007; 242(2): 355 - 359. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |