|
|
||||||||
Evidence-based Radiology Series |
1 From the Department of Specialist Radiology, University College Hospital, Podium Level 2, 235 Euston Rd, London NW1 2BU, England (S.H.); and Cancer Research UKNHS Centre for Statistics in Medicine, Wolfson College Annexe, Oxford, England (D.G.A.). Received November 15, 2005; revision requested December 21; revision received February 23, 2006; final version accepted March 10. Address correspondence to S.H. (e-mail: s.halligan{at}ucl.ac.uk).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2007
Evidence-based medicine is "the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence-based medicine means integrating individual clinical expertise with the best available external evidence from systematic research" (1). Evidence-based medicine has become something of a mantra over the past decade for a variety of reasons. First, there has been an explosion of "evidence"more medical research is being performed than ever before and findings are accumulating rapidly and becoming increasingly available to both clinicians and their patients, not least because of the Internet. Patients and their advocates are increasingly well informed, and, whereas they once accepted their physician's advice unhesitatingly, there is an increasing awareness of choice, uncertainty, and accountability. At the same time, clinicians want the best for their patients and, furthermore, their actions are under ever-closer scrutiny. On a broader scale, health policymakers are reluctant to fund procedures without clear evidence of effectiveness. There is a lot of information "out there"some of it good, some of it not so good, and some of it downright bad. Doctors, patients, and policymakers all need to synthesize information in an objective manner so that it is comprehensible and useful. Systematic review is one way of achieving this. As part of this series on evidence-based practice and how it relates to radiology, we will explain the benefits of a systematic review and provide an overview of how to perform one, with particular reference to studies of diagnostic tests. We will illustrate our article with examples from our own collaboration, which is a systematic review of computed tomographic (CT) colonography (2).
| WHAT IS SYSTEMATIC REVIEW AND WHY DO WE NEED IT? |
|---|
|
|
|---|
Imagine a radiologist who is deciding whether to set up a CT colonography service to screen for colorectal cancer. First, he wants to establish how good the conventional radiologic test is (the barium enema examination) and searches for a narrative review. However, he finds that barium enema examination is either good or bad depending on whether the expert writing the review is a radiologist (8) or a gastroenterologist (9). Frustrated, our radiologist decides to seek expert opinion firsthandhe attends a CT colonography course in Rome and considers whether he could use CT colonography to examine patients with ulcerative colitis. However, when he asks the expert panel at the course if this is reasonable, they cannot agree (10). The more controversial the subject the more likely it is that expert opinion will diverge. Indeed, the courtroom is perhaps the best example of how experts may disagree despite considering the same set of circumstances.
To get unbiased advice, perhaps our radiologist should look at information from research studies? The result from a well-designed and performed randomized trial is generally regarded as the best evidence and is held in higher esteem than "the clinical experience of respected authorities" (11). Our would-be colonographer looks for research to answer a seemingly simple question: "Should an intravenous spasmolytic agent be administered during CT colonography?" Perplexed, he discovers that some researchers find it beneficial (12) while others do not (13).
Unfortunately, too many research studies are poorly designed and executed (14). Journals must fill their pages to be profitable, and poor quality does not necessarily preclude publication. Small, underpowered studies from single centers are far more common than are high-quality trials (15), but these single-center studies are frequently too weak to enable authors to detect the difference being investigated or to exclude it. The field of radiology is especially prone to this trend because the pace of technologic change generates constant pressure to rapidly assess new equipment. Studies that have been carefully planned and performed are likely of higher quality than are those that have been executed rapidly. Also, technologic assessments are easier to perform than are studies to determine the therapeutic effect of radiologic procedures. However, "better" technology frequently fails to translate into better health outcomes for patients, for whom the difference between a fourdetector row and a 64detector row CT scanner is likely to be immaterial (16,17).
Single-center studies are also frequently performed in centers of excellence with specialized radiologists acting as observers. Results of these studies are often not generalizable as a consequence. It is also recognized that reports of trials that have positive findings are more likely to be published (and therefore accessible) than those that do not (18,19)a phenomenon known as "publication bias." This bias distorts the real-world performance of the procedure in question. Furthermore, as with narrative reviews, results and interpretation by the authors of the results of their own studies may be influenced heavily by the specialty of the principal investigator. For example, our radiologist looks for research evidence that CT colonography can depict colonic polyps and finds that articles from single-center studies led by radiologists say it can (20), whereas those from studies led by gastroenterologists say it cannot (21). Surely multi-center studies must be better? In general they are, but our radiologist finds that results also seem to vary with the specialty of the principal investigator: CT colonography is given a "thumbs-up" from radiologists (22) but not from gastroenterologists (23). Again, expert opinion cannot differentiate good from bad with certainty (24).
How can our radiologist rationalize this conflicting information? Increasingly, he would be advised to consult a systematic review. Systematic reviews are articles that summarize other articles (25,26) and are therefore sometimes called "secondary research" or "secondary evidence," as discussed earlier in this series (27). They describe the body of work available on a topic by identifying all relevant articles (known as the "primary studies"), extracting relevant information from them (about methods and findings), and summarizing their results (Fig 1). The difference between a properly conducted systematic review and a narrative review is that there is a formal, careful search procedure for the former so that all relevant research is identified (Fig 2). Also, an explicit selection process based on objective quality criteria determines which studies are included in the review rather than inclusion being based on the whims and biases of the authors. This sifting procedure introduces concepts of both inclusion and exclusion criteria that act as quality filters to let good studies in and keep bad ones out. This procedure also provides an estimate of the relative proportions of high- and low-quality research available on a topic.
|
|
| META-ANALYSIS, RANDOMIZED CONTROLLED TRIALS, AND STUDIES OF DIAGNOSTIC TESTS |
|---|
|
|
|---|
While it is quite possible to have randomized controlled trials of technology implementation, such trials are very unusual in radiology. Radiologic research is usually concerned with the accuracy of a diagnostic test rather than the effect of its implementation on patients' clinical outcome. Clinicians interpreting the results of imaging need accurate information regarding performance characteristics. Is the test accurate enough in the appropriate clinical scenario? This information is exactly what our fictitious CT colonographer wants. The reasons for undertaking a systematic review of a diagnostic test (be it an imaging test, blood test, or pathologic investigation) are exactly the same as for a therapeutic interventionthat is, to collect and, if possible, synthesize information to arrive at a performance estimate that is as precise as possible given the available data (31). Moreover, the process (ie, defining the question, searching the literature, extracting data) is identical to that used for systematic reviews of randomized controlled trials. However, analysis is different, because in good studies of diagnostic tests authors generally report pairs of summary statistics rather than a single overall effect (31). It is also fair to say that statistical methods for combined analysis of studies of diagnostic accuracy are less well established and mature than those for randomized controlled trials (32).
| HOW ARE SYSTEMATIC REVIEWS PERFORMED? |
|---|
|
|
|---|
1. WHAT IS THE RESEARCH QUESTION?
Each systematic review should start with a clear and specific question. It is also self-evident that the correct answer to the question should not be blindingly obvious. Otherwise, why bother? The research question must be defined with precision since this will determine what primary studies are searched for and what data are extracted from them. For example, compare the following questions: "Is CT scanning any good?" "How sensitive and specific is CT colonography for the identification of patients with and those without colon polyps?" Trying to decide which primary studies to select for the former review question would be almost impossible, whereas the second question already largely indicates the type of study required (ie, studies of CT colonography performed in humans with and without colonic polyps). Primary studies are selected on the basis of their relevance to the research question through the setting of objective inclusion and exclusion criteria.
2. INCLUSION AND EXCLUSION CRITERIA
Whether a primary study is selected for a systematic review or not revolves around two issues: relevance to the research question and methodologic quality. Relevance to the research question is most important because all potential primary studies that deal with the topic (even in some tangential way) must be identified; otherwise, the review is not "systematic." Thus, inclusion criteria revolve around the research question being asked.
Once inclusion criteria for primary studies have been determined, the next step is to define exclusion criteria that are based on methodologic quality. These criteria will dictate which of the primary studies identified by the search are ultimately discarded because their scientific methods are too poor for their results to be credible. Methodologic quality can be further broken down into two components: the technical methods used to perform the diagnostic test and the potential for bias because of the study design. For example, studies of CT colonography in which authors do not employ some form of three-dimensional rendering for image visualization might be excluded because the technology is too outdated. On the other hand, while the technical methods may be entirely acceptable, the study design may be unacceptable; for example, observers may be aware of the results from other tests when interpreting CT colonography images.
As we have said, the quality of research articles in radiology is inevitably variable and has been deficient. For example, in an evaluation of 54 articles published 4 years after the clinical introduction of magnetic resonance (MR) imaging, authors found the methods used were generally poor; for instance, in only 22% of the 54 studies were the results of MR imaging verified with an independent comparator (33). Synthesis and meta-analysis of poor studies, for which results are not credible, will be imprecise and of little value. Arriving at the right exclusion criteria is therefore crucial. If the bar is set too high then very few studies will meet the standard required and potentially useful data will be discarded. Conversely, if the bar is set too low then the fundamental principle of excluding poor-quality evidence will not apply. Systematic review cannot make a good study out of several bad ones: "Rubbish in equals rubbish out."
By way of example, how should our aspiring colonographer go about defining his inclusion and exclusion criteria for a systematic review of CT colonography? Primary studies of human subjects with real polyps is a reasonable starting point, since this relates directly to his research question ("How sensitive and specific is CT colonography for identification of patients with and those without polyps?"). This criterion allows narrative reviews, animal studies, phantom experiments, and studies with artificial polyps to be excluded. In regard to methodologic quality, it is firmly established that an independent comparator is required to validate the results of CT colonography (usually colonoscopy). Primary studies without a comparator should be excludedtheir results will not be credible. It is also well established that the results of the test being assessed should not be biased by knowledge of results from the comparator. Thus, in good primary studies CT colonography will have been performed first and the results established before those from subsequent colonoscopy are known. It might be possible to include studies in which colonoscopy was performed first but only if it is absolutely clear that the CT colonographic images were interpreted without knowledge of endoscopic findings (ie, "blind" interpretation).
Reviewers also need to be aware of "spectrum bias." For example, the protocol for the primary study might stipulate that only those patients known to have polyps were recruited in order to increase the prevalence of abnormality in the data. In these circumstances, CT image observers will have a very high a priori expectation of finding a polyp even if the exact endoscopic findings are unknown to them; such studies should be excluded because bias is overwhelming.
It is also helpful to search for established consensus regarding technical performance of the test under investigation (and its comparator), which will exclude studies in which procedures were sloppy or outdated. For example, if our colonographer consulted the consensus statement from the Fourth International Symposium on Virtual Colonoscopy (34), he would find that full bowel preparation and prone and supine scanning are considered mandatory. Studies that do not employ these methods should be excluded.
Other factors, though, are not so clear-cut, and it is these factors that usually determine exactly where the bar is set. For example, although bowel preparation is generally considered mandatory for CT colonography, many different preparations are used. Insistence that all of the primary studies employ exactly the same preparation will result in good studies being rejected. Similarly, while insisting that studies employ multidetector row CT scanners might be reasonable, insisting that all use the same collimation, amperage, pitch, and reconstruction interval is not. While these factors might have some influence on the outcome of the individual study, they are unlikely to have as much influence as the methodologic flaws and biases already discussed.
Our view is that a broadly inclusive approach to primary studies should be adopted whenever there is uncertainty, because this approach provides more data for analysis than an exclusive stance. An inclusive approach also provides a broader overview of the methodologic quality of research in a given field (although it does mean more work!), which is especially valuable when an aim is to develop guidelines to improve future research methods. Some pilot searching may be necessary before inclusion and exclusion criteria are set definitively in order to get an idea of what data are availableit is no good to decide to perform a systematic review of randomized trials of CT colonography when none exist. Consequently, it is important to work hard at defining inclusion and exclusion criteria at the outset. Time spent at this stage will pay dividends later, not least because data extraction is arduous and tedious, and ill-defined research questions and inclusion and exclusion criteria will probably mean that extraction has to be repeated.
So, having thought hard about his research question and the type of primary study necessary to help answer it, our would-be colonographer might draw up a set of inclusion and exclusion criteria like those in Figure 3.
|
It is important that some of the data collected describe the methodologic quality of the primary studies, particularly when dealing with a new test relatively early in its evaluation and certainly if there is an intention to use the results of the review to inform and improve methods for future studies. The inclusion and exclusion criteria reflect the minimum acceptable standard for selectionthe quality of included studies will inevitably vary above and beyond this and some will be better than others. This is especially so if a broadly inclusive approach to selection has been adopted (see section 2). Markers with which to determine the methodologic quality of selected studies are needed.
For example, our researcher has decided that he will exclude studies in which authors do not verify the results of CT colonography with colonoscopy. However, beyond this there are aspects related to colonoscopy that will act as markers for methodologic quality. For example, is the experience of the colonoscopists specified? Do the authors state the equipment used? Is the proportion of patients in whom there was incomplete colonoscopy reported? How were polyps detected at colonoscopy measuredthat is, does the article merely state that "polyps were measured," or does it describe exactly how this was done? Such factors indicate quality of data reporting, and better-quality studies tend to have fuller descriptions of their methods. Also, it is recognized increasingly that colonoscopy is an imperfect reference testpolyps may be missed (35). In an attempt to combat this, some researchers have stipulated that colonic segments are re-examined by the colonoscopists if the CT report suggests that they have missed a polyp. This creates an "enhanced" reference standard that incorporates both CT and endoscopic assessments, against which images from CT and the initial colonoscopy examination can be compared independently (22). Excluding studies from the systematic review because they have not used this method is probably unreasonable (too many valuable studies would be discarded), but collecting data on whether it was employed may allow valuable subset analyses.
4. FINDING AND SELECTING PRIMARY STUDIES
So, our researcher is now armed with a research question, inclusion and exclusion criteria, and a set of primary and secondary questions that describe the data he needs to extract from the literature. A basic premise of good systematic review is that all available studies should be included if they make the gradeit is vital not to miss eligible studies. So, how does our researcher access all of the available literature?
Perhaps the best and easiest place to start is an electronic database, such as MEDLINE (36) or EMBASE (37). MEDLINE is compiled by the National Library of Medicine and is freely available through the National Institutes of Health Internet portal, PubMed (38). MEDLINE indexes several million articles, and approximately 400 000 more are added each year. In an earlier article in this series, Staunton (27) described how to perform a MEDLINE PubMed search; the bedrock is to use search terms that are most likely to capture the studies you are interested in. The terms used to index (and thus identify) articles vary despite articles being very similar. For example, CT colonography is also known as virtual colonoscopy, and some studies will be missed if this search term is omitted. Virtual endoscopy is another example. It is worth spending considerable time deciding on the search terms that will best capture all of the data before an electronic database is searched in earnest. CT colonography was first described in 1994 (39), so it is clear that the search period need not be earlier than this. However, depending on the research question, it may be necessary to search over several decades. Such decisions should be specified up front in the study protocol for the systematic review along with the search terms to be used for electronic databases.
MEDLINE, for example, indexes only approximately 30% of published medical articles (36), so other databases may be needed for coverage sufficient to identify all relevant research. Some databases are topic-specific, such as CancerLit (40), while others are method-specific, such as the Cochrane Collaboration (41), which indexes randomized controlled trials only. Thorough systematic reviews often supplement their electronic searches manuallyfor example, by hand-searching relevant journals. Like the search terms for the electronic search, the journals and the dates over which they are to be searched should be prespecified in the study protocol and broad enough to capture all of the information required.
A manual search for CT colonography articles should not be restricted to radiologic journals but also encompass key surgical, gastroenterologic, and endoscopic journals, and also major general journals (eg, Lancet, Journal of the American Medical Association, British Medical Journal, and New England Journal of Medicine). Decisions regarding the type of study should have been specified in the protocol. For example, are only peer-reviewed full publications acceptable or are abstracts also eligible? What about unpublished data? Abstracts and unpublished data pose considerable problems. There are a lot of meeting proceedings and a lot of abstracts, and peer review is nowhere near so vigorous for those as for a full publication. Perhaps more important, little methodologic information is available directly from an abstract because of size constraints. Information may be preliminary and possibly unreliable. However, if the area of research is very novel with few full articles published, then abstracts can be a source of important information.
Every study encountered during the search will need to be judged against the inclusion and exclusion criteria and a decision made whether to reject or retain the study for formal data extraction. Depending on the research question and the search terms used, hundreds or even thousands of articles may be identified, which is a daunting prospect. Luckily, electronic databases facilitate selection because the abstract of each article is usually readily available. On reading the abstract alone, it is often immediately apparent that the article should be rejected. For example, were humans investigated or not? The PubMed Web site also specifies if articles are reviews rather than original research. If there is any doubt whether an article should be selected or not, then its details should be noted along with those of articles that are definite candidates. It is worthwhile using more than one researcher for this initial sift, not because this saves time, but because the chance of articles being missed is reduced if each searches the database independently.
All articles that remain viable after assessment of their abstracts will need to be retrieved in full for a more informed judgment against the inclusion and exclusion criteria. This retrieval will entail reading the full text of the article if it is available online or photocopying the whole article from the journal. Articles in journals that are not carried by your local library will need to be ordered. It is worth pointing out that some articles will not be written in English although the abstract often is. Excluding articles merely because you cannot understand them will inevitably result in bias against non-English speaking researchers, which is something that should be avoided. Articles should be translated so that a proper judgment can be made. As for the initial sift, it is advisable that at least two researchers assess each full article against the inclusion and exclusion criteria independently. There will inevitably be situations where inclusion is borderline or controversial. This can be resolved by face-to-face discussion, perhaps by using others to arbitrate. A full list of excluded articles should be kept, along with the reason(s) for exclusion, so that this information can be presented in the final systematic review.
5. EXTRACTING THE DATA
Hopefully the procedures described above will have resulted in hundreds of potential articles being whittled down to those that definitely qualify for the systematic review. For example, our colonography researcher is relieved to find that approximately 75% of the indexed literature relating to CT colonography consists of narrative reviews that can be quickly and easily dismissed. Once the primary studies have been selected, and full-text articles on these studies are available, the pertinent data to answer the review questions must be extracted. Just as for study selection, it is sensible to use more than one researcher to extract data independently in order to check consistency. It may even be worthwhile to blind reviewers to the origin of the article and its authors so that assessment is truly unbiased. However, this may be impossible where the reviewers have extensive a priori knowledge of the existing literature in the field. Indeed, it is strongly advisable that at least one member of the review team has expertise in the specific topic of the review.
To facilitate extraction and subsequent analysis, data should be noted in a study table designed with the research questions in mind (see section 3). For diagnostic tests, the most obvious questions relate to sensitivity and specificity. For example, how many patients had polyps seen at CT colonography? How many of these polyps were verified with the reference test? The data collected for systematic reviews of diagnostic tests differ from those for systematic reviews of treatment effects, and we will explore this topic in the following section. In any event, the extraction table must have fields for all the information that is needed. Apart from the results of the diagnostic test, there will be methodologic features that indicate study quality (see section 3), and the extraction table should have fields to accommodate this. An extraction table for a systematic review of CT colonography might look something like Table 1. It is highly desirable to pilot such a table for a few studies initially to see if it is easy to complete in practice. In particular, it is essential that the reviewers understand all of the issues and that the form is unambiguous.
|
6. ANALYSIS AND PRESENTATION OF RESULTS
By this stage, the study table will have been completed as far as possible. In general, there are two types of analysis. First, it is valuable to describe, often in simple terms, the findings of the review with respect to the results and methodologic features in the articles selected (which by inference describes the features of those articles that were rejected). Second, it may be desirable to perform a meta-analysis. Whether or not meta-analysis is possible will depend on the study characteristics and the nature of the data.
The performance of a diagnostic test can be summarized in a variety of ways, but sensitivity and specificity are perhaps most familiar to radiologists. With our example of CT colonography, in good studies authors will have performed the experimental test (ie, CT) in patients whose true disease status (ie, whether patients have polyps) has been established by means of an independent reference test (ie, colonoscopy). Also in good studies authors will have performed CT colonography in both patients with polyps and patients without polyps. There are thus four possible outcomes for the results of the new test: Patients with polyps that are correctly detected at CT colonography have a true-positive (TP) finding with respect to the new test, and patients without polyps who also have a negative result at CT colonography have a true-negative (TN) finding. Few tests are perfect, and so there are inevitably two further outcomes: Patients with polyps that are missed at CT colonography will have a false-negative (FN) finding, and patients with no polyps, but in whom CT findings suggest there is an abnormality, have a false-positive (FP) finding. These four test characteristics can be combined in the form of a 2 x 2 table (Table 2) and a variety of summary statistics can be calculated. Sensitivity can be understood simply as the extent to which the new test helps correctly identify a patient with disease. Similarly, specificity can be understood as the extent to which the new test rejects patients who do not have disease: sensitivity = TP/(TP + FN) and specificity = TN/(TN + FP).
|
There are also other summary statistics that can be calculated that are less familiar than sensitivity and specificity but perhaps more informative. For the result of any diagnostic test, we can calculate the probability of getting that result if the patient truly has the disease with the corresponding probability if they are healthy. The ratio of these two probabilities is the likelihood ratio. For our example of CT colonography, the result of the test is binarythe patient either has polyps or does not. Such tests have a positive LR (LRpos) and a negative LR (LRneg), which describe the discriminatory powers of a positive and negative test result, respectively: LRpos = sensitivity/(1 specificity) and LRneg = (1 sensitivity)/specificity.
A positive likelihood ratio greater than 10 and a negative likelihood ratio less than 0.1 provide "convincing" diagnostic evidence; a positive likelihood ratio greater than 5 and a negative likelihood ratio less than 0.2 provide "strong" diagnostic evidence (43). Again, by using the data in Table 2, LRpos = 4.3 (calculated as 0.69/[1 0.84]) and LRneg = 0.37 (calculated as [1 0.69]/0.84).
The odds ratio is another possible summary statistic. While probability refers to the fraction of times you would expect to see a result (and so ranges from 0 to 1), odds are defined as the probability that an event will occur divided by the probability that it will not occur (and so ranges from 0 to infinity). The diagnostic odds ratio refers to the odds of a positive test result in patients with the disease compared with the odds of the same result in patients without the disease. It combines the likelihood ratios in a single statistic: diagnostic odds ratio = LRpos/LRneg.
With data from Table 2, the diagnostic odds ratio for CT colonography is 11.6 (4.3 divided by 0.37). The diagnostic odds ratio is difficult to apply in clinical practice but is convenient when combining studies for a systematic review, since the result tends to be relatively independent of diagnostic threshold, the importance of which is described below. In contrast, the likelihood ratio has considerable practical value since it can be used to quantify increased certainty associated with a positive diagnosis. For this we need to know the prevalence of the disease in the population being studied (ie, what percentage of subjects actually have the disease in question). The odds of having the disease before the test is performed (pretest odds) is therefore: prevalence/(1 prevalence). The odds of having the disease after the test is performed (posttest odds) is determined as the pretest odds multiplied by the positive likelihood ratio. Likewise, the odds of not having the disease after the test is performed is determined as the pretest odds multiplied by the negative likelihood ratio. Thus the likelihood ratio measures the change in certainty of the diagnosis.
Many studies of diagnostic imaging are performed by using ordered categories to account for differing diagnostic confidence. For example, in a study of CT colonography (44), five categories were used to assess readers' diagnostic confidence regarding polyps: no polyp, possibly a polyp, equivocal, probably a polyp, and definitely a polyp. Sensitivity and specificity can thus be calculated at each of these diagnostic thresholds and the results at each threshold expressed as a single 2 x 2 table (ie, categories above and below an individual threshold are combined). Alternatively, receiver operating characteristic (ROC) curves, familiar to many radiologists, graphically depict the sensitivity and specificity of the test at each diagnostic threshold by plotting the true-positive rate against the false-positive rate: Good tests have curves that rise steeply and pass close to the top left-hand corner, where both sensitivity and specificity equal 1. ROC curves are especially appropriate when the test result is a measurement.
So, we have a variety of summary statistics that can describe the performance of a diagnostic testsensitivity, specificity, likelihood ratios, diagnostic odds ratio, and ROC curve. Synthesizing these for a meta-analysis is a two-stage process. First, summary statistics are derived for each individual study. These are then pooled together to obtain a single overall estimate across all studies. For our example of CT colonography, our researcher is interested in an overall estimate of sensitivity and specificity for polyp detection. Table 2 shows a 2 x 2 table of data from a single study of CT colonography (42), and sensitivity and specificity derived from this data were detailed previously. Similar tables must be populated as far as possible for each of the individual primary studies selected for the systematic review. Once this has been done, a weighted average can be computed across them.
While it is important to have some basic understanding of the summary statistics that are being pooled in a meta-analysis, we do not intend to describe here the mathematical process since it is well beyond the scope of this article. However, there are some important points that should be kept in mind. Perhaps most important of these is the concept of heterogeneity. Statistical heterogeneity describes the variability found in the results of the individual studies, and the degree determines, to some extent, the choice of meta-analytical method. Heterogeneity can be so profound that formal meta-analysis is illogical, indicating that the studies are too varied to be combined meaningfully.
It is important to understand the difference between statistical heterogeneity (ie, variation in the results of the primary studies) and methodologic heterogeneity (ie, diversity in the primary studies). Methodologic heterogeneity (eg, heterogeneity because of factors related to patient selection, study design, or the reference test) may or may not cause statistical heterogeneity. In any event, a meta-analysis of studies with marked statistical heterogeneity is likely to be worthless. Meta-analysis is most appropriate when the component studies have examined similar patients by using comparable methods and reference standards. An ideal meta-analysis would be performed by using completely homogeneous studies so that the results from each individual study are mathematically perfectly compatible with those of any of the others. Not surprising, this homogeneity is rarely achieved in practice, and a degree of heterogeneity is inevitable.
Heterogeneity of component studies can be determined by using a variant of the
2 test, which is used to assess whether individual sensitivities and specificities are comparable. A simple visual assessment can also be made by plotting the sensitivities and specificities from individual studies as an ROC plot (31,32). While some divergence will be inevitable as a result of chance variation, marked divergence may be due to differing patient populations or biasesexperimental or otherwise.
However, generally, unlike studies of treatment effects, it is important to realize that heterogeneity between studies of diagnostic tests may actually be caused by differences in the diagnostic thresholds used to define a positive resultthe "threshold effect." For example, in some studies of CT colonography, only polyps 1 cm or larger have been considered, while in others all sizes of polyps have been investigated (2). Obviously we would expect detection to be poorer for small polyps, and a comparison between studies that analyzed all polyps and those that restricted analysis to the largest polyps is clearly inappropriate without accounting for this difference. Moreover, even if all studies used the same size cutoff points, there may still be differences in diagnostic thresholds because of disparity in the technique used to measure the polyp. For example, polyps may have been measured before or after polypectomy.
It is interesting to note that, where variation between studies is caused mainly by differences in diagnostic thresholds, the data points from individual studies tend to have curvatures that parallel the underlying ROC curve for the test in question (31). If observed heterogeneity is due to differing diagnostic thresholds, then summary estimates of sensitivity and specificity will underestimate the capabilities of the test. In this situation, the best approach to meta-analysis is to combine the ROC curves from each individual study in a procedure that creates a single summary ROC curve. Statistical methods for deriving the best-fitting summary ROC curve are necessarily complex and relatively immature (32) and certainly well beyond the scope of this article.
If we imagine that our would-be colonographer has had his data analyzed by a friendly statistician, he might be presented with a forest plot, a well-established format for pictorial representation of a meta-analysis. Figure 4 shows a forest plot for the sensitivity of CT colonography for per-patient detection of colorectal polyps measuring 5 mm or larger. There are 13 primary studies, each represented by a horizontal line with a central black square. The square indicates the point estimate of the study and its area indicates the relative contribution the study makes to the meta-analysis. The horizontal line represents the 95% confidence intervals around the point estimate. Underneath all of these, the point estimate and confidence intervals for the meta-analysis of all 13 component studies are presented as a diamond (point estimate, 0.81; 95% confidence interval: 0.78, 0.83) in this example (Fig 4). Clearly, such a presentation and analysis is possible (and desirable) for specificity and also at different cutoff pointsfor example, restricted only to polyps measuring 1 cm or larger.
|
|
| INTERPRETATION AND CONSEQUENCES OF SYSTEMATIC REVIEW AND META-ANALYSIS |
|---|
|
|
|---|
Sensible interpretation of meta-analytical point estimates revolve around how reliable they are likely to be in practice. In essence, are they believable? To an extent, the confidence intervals around the estimates indicate their precision. For example, a recent meta-analysis of MR colonography showed a sensitivity of 75% but with wide confidence intervals of 47% and 91% and component study sensitivities ranging from 8.6% to 100% (50). However, even if confidence intervals are tight, it is important to realize that they do not indicate study quality. It is here that the systematic review component is most helpful. For example, our reviewer has discovered that the reporting quality for studies of CT colonography is generally poor. For example, a fully populated 2 x 2 contingency table for per-patient data could be extracted from the published article for only 50% of studies despite the fact that this is the central analysis and these were studies of quality sufficient to make the review (2). Our observer wants to use CT for screening, but in most studies symptomatic patients were examined, which undoubtedly effects the generalizability of point estimates. Also, most primary studies were performed by researchers from specialist centers, which also limits generalizability. It is time for our researcher to write a formal report of his systematic review describing what information he has found and, perhaps more important, what he has not been able to find. From this will emerge recommendations for future research.
While systematic review and meta-analysis often go hand in hand, it is unfortunate that the meta-analytical component tends to "hog the limelight." This is understandable: Non-statisticians perceive the analysis as mysterious and difficult, and it is also conceptually attractive to boil the whole review down to a single "point estimate" of performance. However, while formal meta-analysis is probably not something the average radiologist should attempt, it is rather easy for the average statistician when presented with a completed data table. We would like to stress above all else that it is the systematic review component that is the most taxing but potentially the most useful. An assessment of the quality of published research findings helps indicate exactly where there is a need for improved methods and data reporting and where there is a shortage of good available evidence. This is most apparent during data extraction, since it is at this point that the inadequacy of existing research is most visible.
This brings us back to the statement made earlier that there is a widely held belief that since the primary research has "already been done," systematic review and meta-analysis can be performed quickly. Good systematic reviews are hardly ever completed quickly when performed thoroughly because we do not live in an ideal worldthe quality of data reporting is variable despite guidelines for the presentation of studies of diagnostic accuracy (48,49). Good systematic reviews get right to the heart of specific areas of research: What is the quality of work available in the field? Is it adequate enough for us to gather the information we need to inform day-to-day clinical decision-making? If not, what is needed to put things right? The answers to these questions are at least as important as a mathematical point estimate. On the basis of his experiences, our would-be colonographer may feel obliged to construct a "minimum data set," specific to studies of CT colonography, that describes exactly what methods are required and how data should be presented so that pertinent information can be readily obtained in the future.
| APPLYING RESULTS OF SYSTEMATIC REVIEW IN PRACTICE: A "BOTTOM-UP" PERSPECTIVE |
|---|
|
|
|---|
For medium-sized and larger polyps, our colonography researcher has found an average per-patient sensitivity and specificity of 86% and 86%. These estimates can be used with the pretest probabilities of having a polyp to determine the posttest probability in light of the test result, by using Bayes theorem. For example, the prevalence of clinically important polyps in patients with and in those without a positive family history can be estimated by means of a literature search (53). If our researcher assumes that the pretest probability of adenomas is 5% in those patients without a family history and 15% in those with a family history and then inputs these data along with point estimates from the systematic review into a spreadsheet designed to generate conditional probabilities from summary statistics of diagnostic tests (54), then a negative CT colonography result reduces these probabilities to 0.008% and 0.027% for those without and those with a family history, respectively. These estimates can be given to referring clinicians and integrated with patient factors (preference for one test over another, or comorbidity, for example). Similar calculations could also be performed for barium enema examination and colonoscopy. It is important to remember that the goalposts are always moving, especially for imaging technologies, and the literature needs to be reviewed periodically until equipoise has been reached.
| LIMITATIONS AND PROBLEMS OF SYSTEMATIC REVIEWS AND META-ANALYSIS |
|---|
|
|
|---|
Articles are most visible if they are published in indexed journals, but publication is sometimes dependent on whether the results are positive (ie, the new test or treatment "works") (59). Such trials are also more likely to be published in English, more likely to be published rapidly and repeatedly, and more likely to be cited (59). It is also recognized that the topics selected for systematic review are influenced by prior knowledge of the results of the component studiesif the findings from CT colonography studies were uniformly poor, there would be little inclination to perform a systematic review. It is interesting to note that, at the time of writing this report, no studies have been performed to examine publication bias in studies of imaging tests.
There is also evidence that studies of diagnostic tests are generally less methodologically vigorous than are studies of treatment effects in randomized controlled trials (60), which makes the combination of studies of diagnostic tests more problematic. For example, we have already noted the generally poor assessment of MR imaging (33), and although improving, methods for studies of diagnostic tests still lag behind those employed in randomized trials (61). Furthermore, statistical methods for meta-analysis of measures of diagnostic accuracy are less developed than those for assessment of treatment effects. For example, it is unknown whether the simpler methods used to pool estimates of diagnostic performance are actually misleading in practice, and more research is required to determine the exact influence of threshold effects on heterogeneity (31,32).
We have tried to explain how to identify and select articles for a systematic review with as little bias as possible. Researchers will inevitably disagree about the minimum methodologic standard necessary for inclusion. For example, some might argue that a systematic review of CT colonography should be restricted to studies with multidetector row scanners while others might consider this less important than other factors. Because of this disagreement, it is vital that the review team includes at least one expert familiar with the technology being assessed so that decisions are informed. Successful systematic reviews require a multitude of different skills, and the composition of the review team should reflect this.
We should also be careful not to dismiss expert opinion. Experts may actually know something worthwhile and have something useful to say! Narrative reviews remain popular with readers, not least in radiology (3). Small trials are not always "poor" trials and can divulge valuable information, not least by providing the raw material for meta-analysis and modeling (15). Findings of such trials may be the only source of information regarding some interventions. It is also important not to dismiss radiologic research just because the methods are not as rigorous as those found elsewhere. There is an important distinction between the best evidence in theory and the best available evidence in practice.
| QUALITY ASSESSMENT OF SYSTEMATIC REVIEWS |
|---|
|
|
|---|
Just because a review is "systematic" does not mean it is automatically superior. The methodologic quality of an individual systematic review may be judged just as easily as the primary studies from which it is derived and may be judged by using evidence-based practice methodology. For example, how comprehensive and unbiased was the search for studies? Are the databases described, along with the timescale, the search terms, and the number of researchers? Were the inclusion and exclusion criteria stated explicitly, are they reasonable, and are they discussed? Are the demographic data of the patient populations and clinical settings described for the component studies? Are the definitions of a positive finding and a negative finding given for both the new test and the reference test? What was their temporal separation? Are technical failures reported? Does the review present 2 x 2 tables of extracted binary data or cutoff points for continuous data? If meta-analysis has been performed, was this a reasonable thing to do and was heterogeneity of the primary studies tested for (and, if so, how)? Are the methods of the meta-analysis described adequately? How many studies and patients were included in the meta-analysis? Are the reasons for exclusions stated? If there was no attempt at performing a meta-analysis, why not? This list could go on but makes the point that not all systematic reviews are equal (6466). Systematic reviews of systematic reviews are already here (67)!
| THE COCHRANE COLLABORATION |
|---|
|
|
|---|
The development of the Cochrane Collaboration logo makes for interesting reading: The logo illustrates a forest plot of results of seven randomized controlled trials on the effects of administering steroids to women who were about to give birth prematurely. The first trial was reported in 1972 and the last in 1980, and the point estimate derived from these shows the clear benefit of giving steroids had these trials been reviewed systematically at that time. Unfortunately, a systematic review was not performed until a decade later, and thousands of babies died despite the availability of information that could have saved them.
The main work of the Cochrane Collaboration is performed by Collaborative Review Groups supported by the Cochrane Centres around the world. The groups prepare and maintain the Cochrane Reviews, which are held in a central repository, and the Cochrane Library, which may be purchased on a CD-ROM or subscribed to on the Internet (Appendix). There are also method groups, who advise on statistical methods, and extensive consumer groups. There are considerable logistic, methodologic, and ethical challenges facing the Cochrane Collaboration, but the alternative is to make decisions about health care without regard to the best available evidence. At the time of writing this report, only randomized controlled trials are reviewed, but it is likely that they will extend into assessments of diagnostic accuracy in the near future.
| APPENDIX |
|---|
|
|
|---|