Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


DOI: 10.1148/radiol.2431051823
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Halligan, S.
Right arrow Articles by Altman, D. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Halligan, S.
Right arrow Articles by Altman, D. G.
(Radiology 2007;243:13-27.)
© RSNA, 2007


Evidence-based Radiology Series

Evidence-based Practice in Radiology: Steps 3 and 4—Appraise and Apply Systematic Reviews and Meta-Analyses1

Steve Halligan, MD, FRCP, FRCR and Douglas G. Altman, DSc

1 From the Department of Specialist Radiology, University College Hospital, Podium Level 2, 235 Euston Rd, London NW1 2BU, England (S.H.); and Cancer Research UK–NHS Centre for Statistics in Medicine, Wolfson College Annexe, Oxford, England (D.G.A.). Received November 15, 2005; revision requested December 21; revision received February 23, 2006; final version accepted March 10. Address correspondence to S.H. (e-mail: s.halligan{at}ucl.ac.uk).


    ABSTRACT
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
A systematic review is performed in attempt to answer a specific research question by means of objective, unbiased evaluation of all pertinent available evidence. Component primary studies are selected on the basis of quality, and, if possible, their results are combined mathematically by using a process known as meta-analysis. While systematic review and meta-analysis are well-established methods to assess trials of therapeutic effects, they are increasingly more common in studies of diagnostic tests. In this article, the authors describe the benefits of a systematic approach over the traditional narrative review, illustrate the process, and examine some problems that are specific to systematic review and meta-analysis of diagnostic tests. They also explain how systematic review can help guide methodologic development for future research.

© RSNA, 2007

Evidence-based medicine is "the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence-based medicine means integrating individual clinical expertise with the best available external evidence from systematic research" (1). Evidence-based medicine has become something of a mantra over the past decade for a variety of reasons. First, there has been an explosion of "evidence"—more medical research is being performed than ever before and findings are accumulating rapidly and becoming increasingly available to both clinicians and their patients, not least because of the Internet. Patients and their advocates are increasingly well informed, and, whereas they once accepted their physician's advice unhesitatingly, there is an increasing awareness of choice, uncertainty, and accountability. At the same time, clinicians want the best for their patients and, furthermore, their actions are under ever-closer scrutiny. On a broader scale, health policymakers are reluctant to fund procedures without clear evidence of effectiveness. There is a lot of information "out there"—some of it good, some of it not so good, and some of it downright bad. Doctors, patients, and policymakers all need to synthesize information in an objective manner so that it is comprehensible and useful. Systematic review is one way of achieving this. As part of this series on evidence-based practice and how it relates to radiology, we will explain the benefits of a systematic review and provide an overview of how to perform one, with particular reference to studies of diagnostic tests. We will illustrate our article with examples from our own collaboration, which is a systematic review of computed tomographic (CT) colonography (2).


    WHAT IS SYSTEMATIC REVIEW AND WHY DO WE NEED IT?
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
A very popular way to access information about good clinical practice is to search for a published traditional review. These "narrative reviews" are a descriptive overview by an expert (or experts) of a selection of studies with published findings (3). Narrative reviews remain hugely popular. For example, of the articles published in Radiology in 2004, eight of the top 10 (80%) most accessed articles online—including the most accessed article (4)—were narrative reviews. The Radiology series "State of the Art" and "How I Do It" are typical examples of narrative reviews, where experts are invited to write for the journal (like the article you are reading!). However, objective definition of an "expert" remains elusive, and experts, like the rest of us, have their own biases and inclinations: We are all familiar with the saying, "Ask 10 different experts and you will get 10 different answers." While experts have been perceived as the guardians of best practice, this view is increasingly being challenged by claims that experts are out of touch with everyday experience, do not recognize their own limitations, and often fail to acknowledge the expertise of others (5,6). Biased descriptions of evidence are common, and it is well recognized that studies with findings that are in line with the author's opinion are reported selectively (6,7). Experts may also be influenced by their medical specialty.

Imagine a radiologist who is deciding whether to set up a CT colonography service to screen for colorectal cancer. First, he wants to establish how good the conventional radiologic test is (the barium enema examination) and searches for a narrative review. However, he finds that barium enema examination is either good or bad depending on whether the expert writing the review is a radiologist (8) or a gastroenterologist (9). Frustrated, our radiologist decides to seek expert opinion firsthand—he attends a CT colonography course in Rome and considers whether he could use CT colonography to examine patients with ulcerative colitis. However, when he asks the expert panel at the course if this is reasonable, they cannot agree (10). The more controversial the subject the more likely it is that expert opinion will diverge. Indeed, the courtroom is perhaps the best example of how experts may disagree despite considering the same set of circumstances.

To get unbiased advice, perhaps our radiologist should look at information from research studies? The result from a well-designed and performed randomized trial is generally regarded as the best evidence and is held in higher esteem than "the clinical experience of respected authorities" (11). Our would-be colonographer looks for research to answer a seemingly simple question: "Should an intravenous spasmolytic agent be administered during CT colonography?" Perplexed, he discovers that some researchers find it beneficial (12) while others do not (13).

Unfortunately, too many research studies are poorly designed and executed (14). Journals must fill their pages to be profitable, and poor quality does not necessarily preclude publication. Small, underpowered studies from single centers are far more common than are high-quality trials (15), but these single-center studies are frequently too weak to enable authors to detect the difference being investigated or to exclude it. The field of radiology is especially prone to this trend because the pace of technologic change generates constant pressure to rapidly assess new equipment. Studies that have been carefully planned and performed are likely of higher quality than are those that have been executed rapidly. Also, technologic assessments are easier to perform than are studies to determine the therapeutic effect of radiologic procedures. However, "better" technology frequently fails to translate into better health outcomes for patients, for whom the difference between a four–detector row and a 64–detector row CT scanner is likely to be immaterial (16,17).

Single-center studies are also frequently performed in centers of excellence with specialized radiologists acting as observers. Results of these studies are often not generalizable as a consequence. It is also recognized that reports of trials that have positive findings are more likely to be published (and therefore accessible) than those that do not (18,19)—a phenomenon known as "publication bias." This bias distorts the real-world performance of the procedure in question. Furthermore, as with narrative reviews, results and interpretation by the authors of the results of their own studies may be influenced heavily by the specialty of the principal investigator. For example, our radiologist looks for research evidence that CT colonography can depict colonic polyps and finds that articles from single-center studies led by radiologists say it can (20), whereas those from studies led by gastroenterologists say it cannot (21). Surely multi-center studies must be better? In general they are, but our radiologist finds that results also seem to vary with the specialty of the principal investigator: CT colonography is given a "thumbs-up" from radiologists (22) but not from gastroenterologists (23). Again, expert opinion cannot differentiate good from bad with certainty (24).

How can our radiologist rationalize this conflicting information? Increasingly, he would be advised to consult a systematic review. Systematic reviews are articles that summarize other articles (25,26) and are therefore sometimes called "secondary research" or "secondary evidence," as discussed earlier in this series (27). They describe the body of work available on a topic by identifying all relevant articles (known as the "primary studies"), extracting relevant information from them (about methods and findings), and summarizing their results (Fig 1). The difference between a properly conducted systematic review and a narrative review is that there is a formal, careful search procedure for the former so that all relevant research is identified (Fig 2). Also, an explicit selection process based on objective quality criteria determines which studies are included in the review rather than inclusion being based on the whims and biases of the authors. This sifting procedure introduces concepts of both inclusion and exclusion criteria that act as quality filters to let good studies in and keep bad ones out. This procedure also provides an estimate of the relative proportions of high- and low-quality research available on a topic.


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1: Flow chart depicts the step-by-step process for systematic review of a diagnostic imaging test.

 

Figure 2
View larger version (28K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 2: Chart compares the characteristics of narrative and systematic reviews.

 
Study selection for a systematic review is thus more transparent than for a traditional narrative review. Identification of evidence and quality sifting are performed by the systematic reviewers on behalf of the reader, and information is available in one place, distilled and accessible. For example, the Cochrane Collaboration (28), which will be discussed more later in this article, is an organization that generates such reviews. It is important to note that systematic reviews aim to provide an unbiased assessment of the quality of available research on a particular topic. If the findings of the review suggest that available data are of poor quality, then the findings can be used as the basis for suggestions as to how to improve the situation.


    META-ANALYSIS, RANDOMIZED CONTROLLED TRIALS, AND STUDIES OF DIAGNOSTIC TESTS
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
If a systematic review contains articles on enough primary studies of sufficient quality, and their data are suitable, then it may be possible to synthesize their results mathematically. In simplistic terms, the results of a number of small studies are combined to arrive at the result that might have been obtained from a single large study. This procedure is known as "meta-analysis." To date, most meta-analysis has been conducted to deal with treatment effects following a therapeutic intervention, typically as part of a randomized controlled drug trial. Patients in such trials are allocated by chance to either the new treatment or a control group (eg, an established drug or placebo). Famously, for example, the disastrous effects of administering class I antiarrhythmic agents after myocardial infarction was hinted at in one of the first systematic reviews of controlled trials, in the face of attractive theoretical reasons for administration (29,30).

While it is quite possible to have randomized controlled trials of technology implementation, such trials are very unusual in radiology. Radiologic research is usually concerned with the accuracy of a diagnostic test rather than the effect of its implementation on patients' clinical outcome. Clinicians interpreting the results of imaging need accurate information regarding performance characteristics. Is the test accurate enough in the appropriate clinical scenario? This information is exactly what our fictitious CT colonographer wants. The reasons for undertaking a systematic review of a diagnostic test (be it an imaging test, blood test, or pathologic investigation) are exactly the same as for a therapeutic intervention—that is, to collect and, if possible, synthesize information to arrive at a performance estimate that is as precise as possible given the available data (31). Moreover, the process (ie, defining the question, searching the literature, extracting data) is identical to that used for systematic reviews of randomized controlled trials. However, analysis is different, because in good studies of diagnostic tests authors generally report pairs of summary statistics rather than a single overall effect (31). It is also fair to say that statistical methods for combined analysis of studies of diagnostic accuracy are less well established and mature than those for randomized controlled trials (32).


    HOW ARE SYSTEMATIC REVIEWS PERFORMED?
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
General principles apply to all systematic reviews regardless of the type of primary study being assessed (eg, randomized controlled trial of a therapeutic agent or technical assessment of a diagnostic test), and we outline these below. However, some of the specifics will inevitably vary—notably, consideration of methodologic quality and methods of statistical analysis. From the outset, precise formulation of the topic of research, the questions that need answering, and the intended analyses will make the job easier for the researchers. With this in mind, it is worth pointing out that one of the authors (S.H.), hoping to perform a systematic review of CT colonography, was astonished to be asked by the other (D.G.A.) for a fully written study protocol. There is a widely held belief that systematic reviews are not "real research" because patients are not approached directly and the research has "already been done" by others. Misguided investigators believe that data for a systematic review can be assembled rapidly and a meta-analysis performed quickly, with the whole process taking just a few weeks; in reality, nothing could be further from the truth. Like the articles on which they are based, systematic reviews themselves also have variable quality (more on this in a later section). The best systematic reviews will have a clearly written research protocol, which should detail the research question, methods, inclusion and exclusion criteria, outcome measures, and the intended analysis. However, while a protocol instills much objectivity, systematic reviews necessarily require a degree of subjectivity also.

1. WHAT IS THE RESEARCH QUESTION?
Each systematic review should start with a clear and specific question. It is also self-evident that the correct answer to the question should not be blindingly obvious. Otherwise, why bother? The research question must be defined with precision since this will determine what primary studies are searched for and what data are extracted from them. For example, compare the following questions: "Is CT scanning any good?" "How sensitive and specific is CT colonography for the identification of patients with and those without colon polyps?" Trying to decide which primary studies to select for the former review question would be almost impossible, whereas the second question already largely indicates the type of study required (ie, studies of CT colonography performed in humans with and without colonic polyps). Primary studies are selected on the basis of their relevance to the research question through the setting of objective inclusion and exclusion criteria.

2. INCLUSION AND EXCLUSION CRITERIA
Whether a primary study is selected for a systematic review or not revolves around two issues: relevance to the research question and methodologic quality. Relevance to the research question is most important because all potential primary studies that deal with the topic (even in some tangential way) must be identified; otherwise, the review is not "systematic." Thus, inclusion criteria revolve around the research question being asked.

Once inclusion criteria for primary studies have been determined, the next step is to define exclusion criteria that are based on methodologic quality. These criteria will dictate which of the primary studies identified by the search are ultimately discarded because their scientific methods are too poor for their results to be credible. Methodologic quality can be further broken down into two components: the technical methods used to perform the diagnostic test and the potential for bias because of the study design. For example, studies of CT colonography in which authors do not employ some form of three-dimensional rendering for image visualization might be excluded because the technology is too outdated. On the other hand, while the technical methods may be entirely acceptable, the study design may be unacceptable; for example, observers may be aware of the results from other tests when interpreting CT colonography images.

As we have said, the quality of research articles in radiology is inevitably variable and has been deficient. For example, in an evaluation of 54 articles published 4 years after the clinical introduction of magnetic resonance (MR) imaging, authors found the methods used were generally poor; for instance, in only 22% of the 54 studies were the results of MR imaging verified with an independent comparator (33). Synthesis and meta-analysis of poor studies, for which results are not credible, will be imprecise and of little value. Arriving at the right exclusion criteria is therefore crucial. If the bar is set too high then very few studies will meet the standard required and potentially useful data will be discarded. Conversely, if the bar is set too low then the fundamental principle of excluding poor-quality evidence will not apply. Systematic review cannot make a good study out of several bad ones: "Rubbish in equals rubbish out."

By way of example, how should our aspiring colonographer go about defining his inclusion and exclusion criteria for a systematic review of CT colonography? Primary studies of human subjects with real polyps is a reasonable starting point, since this relates directly to his research question ("How sensitive and specific is CT colonography for identification of patients with and those without polyps?"). This criterion allows narrative reviews, animal studies, phantom experiments, and studies with artificial polyps to be excluded. In regard to methodologic quality, it is firmly established that an independent comparator is required to validate the results of CT colonography (usually colonoscopy). Primary studies without a comparator should be excluded—their results will not be credible. It is also well established that the results of the test being assessed should not be biased by knowledge of results from the comparator. Thus, in good primary studies CT colonography will have been performed first and the results established before those from subsequent colonoscopy are known. It might be possible to include studies in which colonoscopy was performed first but only if it is absolutely clear that the CT colonographic images were interpreted without knowledge of endoscopic findings (ie, "blind" interpretation).

Reviewers also need to be aware of "spectrum bias." For example, the protocol for the primary study might stipulate that only those patients known to have polyps were recruited in order to increase the prevalence of abnormality in the data. In these circumstances, CT image observers will have a very high a priori expectation of finding a polyp even if the exact endoscopic findings are unknown to them; such studies should be excluded because bias is overwhelming.

It is also helpful to search for established consensus regarding technical performance of the test under investigation (and its comparator), which will exclude studies in which procedures were sloppy or outdated. For example, if our colonographer consulted the consensus statement from the Fourth International Symposium on Virtual Colonoscopy (34), he would find that full bowel preparation and prone and supine scanning are considered mandatory. Studies that do not employ these methods should be excluded.

Other factors, though, are not so clear-cut, and it is these factors that usually determine exactly where the bar is set. For example, although bowel preparation is generally considered mandatory for CT colonography, many different preparations are used. Insistence that all of the primary studies employ exactly the same preparation will result in good studies being rejected. Similarly, while insisting that studies employ multi–detector row CT scanners might be reasonable, insisting that all use the same collimation, amperage, pitch, and reconstruction interval is not. While these factors might have some influence on the outcome of the individual study, they are unlikely to have as much influence as the methodologic flaws and biases already discussed.

Our view is that a broadly inclusive approach to primary studies should be adopted whenever there is uncertainty, because this approach provides more data for analysis than an exclusive stance. An inclusive approach also provides a broader overview of the methodologic quality of research in a given field (although it does mean more work!), which is especially valuable when an aim is to develop guidelines to improve future research methods. Some pilot searching may be necessary before inclusion and exclusion criteria are set definitively in order to get an idea of what data are available—it is no good to decide to perform a systematic review of randomized trials of CT colonography when none exist. Consequently, it is important to work hard at defining inclusion and exclusion criteria at the outset. Time spent at this stage will pay dividends later, not least because data extraction is arduous and tedious, and ill-defined research questions and inclusion and exclusion criteria will probably mean that extraction has to be repeated.

So, having thought hard about his research question and the type of primary study necessary to help answer it, our would-be colonographer might draw up a set of inclusion and exclusion criteria like those in Figure 3.


Figure 3
View larger version (24K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 3: Potential inclusion and exclusion criteria for study selection for a systematic review of CT colonography.

 
3. WHAT DATA NEED TO BE COLLECTED?
To a large extent the research questions determine what data need to be collected from the primary studies that have satisfied the inclusion criteria for the review. For example, our colonographer has asked, "How sensitive and specific is CT colonography for the identification of patients with and without colon polyps?". It follows that he will be looking for data from individual studies that describe the sensitivity and specificity of CT colonography for polyps. While it may be fine to stop at that point, valuable information on other issues might be provided by the review. For example, we have already said that our colonographer is interested in whether an intravenous spasmolytic agent should be administered, so it would be sensible to collect data on this—it may be possible to perform a subset analysis to determine whether administration improves sensitivity and/or specificity. Indeed there are many interesting additional questions that could be investigated; collimation is obvious ("Does thinner collimation improve polyp detection?") but there are others—the type of bowel preparation used, observer experience, screening patients or symptomatic patients, and so on. Because there are possible additional questions, it is important to define which data are of most interest and which can be left for another day: Extraction will be unmanageable if too many variables are needed from each primary study, but if there are too few variables the opportunity for additional information and analysis is lost. Comprehensive data collection must be balanced against volume of work, which is determined by the number of primary studies likely to be included (discussed in the next section). Such decisions should be made at the outset and included in the study protocol, not least because it is not possible to make unbiased judgments once all the data have been collected and it is known what the effect of such decisions will be.

It is important that some of the data collected describe the methodologic quality of the primary studies, particularly when dealing with a new test relatively early in its evaluation and certainly if there is an intention to use the results of the review to inform and improve methods for future studies. The inclusion and exclusion criteria reflect the minimum acceptable standard for selection—the quality of included studies will inevitably vary above and beyond this and some will be better than others. This is especially so if a broadly inclusive approach to selection has been adopted (see section 2). Markers with which to determine the methodologic quality of selected studies are needed.

For example, our researcher has decided that he will exclude studies in which authors do not verify the results of CT colonography with colonoscopy. However, beyond this there are aspects related to colonoscopy that will act as markers for methodologic quality. For example, is the experience of the colonoscopists specified? Do the authors state the equipment used? Is the proportion of patients in whom there was incomplete colonoscopy reported? How were polyps detected at colonoscopy measured—that is, does the article merely state that "polyps were measured," or does it describe exactly how this was done? Such factors indicate quality of data reporting, and better-quality studies tend to have fuller descriptions of their methods. Also, it is recognized increasingly that colonoscopy is an imperfect reference test—polyps may be missed (35). In an attempt to combat this, some researchers have stipulated that colonic segments are re-examined by the colonoscopists if the CT report suggests that they have missed a polyp. This creates an "enhanced" reference standard that incorporates both CT and endoscopic assessments, against which images from CT and the initial colonoscopy examination can be compared independently (22). Excluding studies from the systematic review because they have not used this method is probably unreasonable (too many valuable studies would be discarded), but collecting data on whether it was employed may allow valuable subset analyses.

4. FINDING AND SELECTING PRIMARY STUDIES
So, our researcher is now armed with a research question, inclusion and exclusion criteria, and a set of primary and secondary questions that describe the data he needs to extract from the literature. A basic premise of good systematic review is that all available studies should be included if they make the grade—it is vital not to miss eligible studies. So, how does our researcher access all of the available literature?

Perhaps the best and easiest place to start is an electronic database, such as MEDLINE (36) or EMBASE (37). MEDLINE is compiled by the National Library of Medicine and is freely available through the National Institutes of Health Internet portal, PubMed (38). MEDLINE indexes several million articles, and approximately 400 000 more are added each year. In an earlier article in this series, Staunton (27) described how to perform a MEDLINE PubMed search; the bedrock is to use search terms that are most likely to capture the studies you are interested in. The terms used to index (and thus identify) articles vary despite articles being very similar. For example, CT colonography is also known as virtual colonoscopy, and some studies will be missed if this search term is omitted. Virtual endoscopy is another example. It is worth spending considerable time deciding on the search terms that will best capture all of the data before an electronic database is searched in earnest. CT colonography was first described in 1994 (39), so it is clear that the search period need not be earlier than this. However, depending on the research question, it may be necessary to search over several decades. Such decisions should be specified up front in the study protocol for the systematic review along with the search terms to be used for electronic databases.

MEDLINE, for example, indexes only approximately 30% of published medical articles (36), so other databases may be needed for coverage sufficient to identify all relevant research. Some databases are topic-specific, such as CancerLit (40), while others are method-specific, such as the Cochrane Collaboration (41), which indexes randomized controlled trials only. Thorough systematic reviews often supplement their electronic searches manually—for example, by hand-searching relevant journals. Like the search terms for the electronic search, the journals and the dates over which they are to be searched should be prespecified in the study protocol and broad enough to capture all of the information required.

A manual search for CT colonography articles should not be restricted to radiologic journals but also encompass key surgical, gastroenterologic, and endoscopic journals, and also major general journals (eg, Lancet, Journal of the American Medical Association, British Medical Journal, and New England Journal of Medicine). Decisions regarding the type of study should have been specified in the protocol. For example, are only peer-reviewed full publications acceptable or are abstracts also eligible? What about unpublished data? Abstracts and unpublished data pose considerable problems. There are a lot of meeting proceedings and a lot of abstracts, and peer review is nowhere near so vigorous for those as for a full publication. Perhaps more important, little methodologic information is available directly from an abstract because of size constraints. Information may be preliminary and possibly unreliable. However, if the area of research is very novel with few full articles published, then abstracts can be a source of important information.

Every study encountered during the search will need to be judged against the inclusion and exclusion criteria and a decision made whether to reject or retain the study for formal data extraction. Depending on the research question and the search terms used, hundreds or even thousands of articles may be identified, which is a daunting prospect. Luckily, electronic databases facilitate selection because the abstract of each article is usually readily available. On reading the abstract alone, it is often immediately apparent that the article should be rejected. For example, were humans investigated or not? The PubMed Web site also specifies if articles are reviews rather than original research. If there is any doubt whether an article should be selected or not, then its details should be noted along with those of articles that are definite candidates. It is worthwhile using more than one researcher for this initial sift, not because this saves time, but because the chance of articles being missed is reduced if each searches the database independently.

All articles that remain viable after assessment of their abstracts will need to be retrieved in full for a more informed judgment against the inclusion and exclusion criteria. This retrieval will entail reading the full text of the article if it is available online or photocopying the whole article from the journal. Articles in journals that are not carried by your local library will need to be ordered. It is worth pointing out that some articles will not be written in English although the abstract often is. Excluding articles merely because you cannot understand them will inevitably result in bias against non-English speaking researchers, which is something that should be avoided. Articles should be translated so that a proper judgment can be made. As for the initial sift, it is advisable that at least two researchers assess each full article against the inclusion and exclusion criteria independently. There will inevitably be situations where inclusion is borderline or controversial. This can be resolved by face-to-face discussion, perhaps by using others to arbitrate. A full list of excluded articles should be kept, along with the reason(s) for exclusion, so that this information can be presented in the final systematic review.

5. EXTRACTING THE DATA
Hopefully the procedures described above will have resulted in hundreds of potential articles being whittled down to those that definitely qualify for the systematic review. For example, our colonography researcher is relieved to find that approximately 75% of the indexed literature relating to CT colonography consists of narrative reviews that can be quickly and easily dismissed. Once the primary studies have been selected, and full-text articles on these studies are available, the pertinent data to answer the review questions must be extracted. Just as for study selection, it is sensible to use more than one researcher to extract data independently in order to check consistency. It may even be worthwhile to blind reviewers to the origin of the article and its authors so that assessment is truly unbiased. However, this may be impossible where the reviewers have extensive a priori knowledge of the existing literature in the field. Indeed, it is strongly advisable that at least one member of the review team has expertise in the specific topic of the review.

To facilitate extraction and subsequent analysis, data should be noted in a study table designed with the research questions in mind (see section 3). For diagnostic tests, the most obvious questions relate to sensitivity and specificity. For example, how many patients had polyps seen at CT colonography? How many of these polyps were verified with the reference test? The data collected for systematic reviews of diagnostic tests differ from those for systematic reviews of treatment effects, and we will explore this topic in the following section. In any event, the extraction table must have fields for all the information that is needed. Apart from the results of the diagnostic test, there will be methodologic features that indicate study quality (see section 3), and the extraction table should have fields to accommodate this. An extraction table for a systematic review of CT colonography might look something like Table 1. It is highly desirable to pilot such a table for a few studies initially to see if it is easy to complete in practice. In particular, it is essential that the reviewers understand all of the issues and that the form is unambiguous.


View this table:
[in this window]
[in a new window]

 
Table 1. Methodologic and Numeric Data Extracted from a Single Study Comparing CT Colonography with Colonoscopy

 
Once formal data extraction begins, it will rapidly become apparent that not all studies contain enough information to fully complete the table. In some cases it might be possible to obtain information indirectly by calculation. For example, the number of patients with true-negative findings at CT colonography can be calculated if figures for true-positive, false-positive, and false-negative findings are provided. However, the quality of study reporting is generally such that large amounts of pertinent information are frequently missing from the original article despite the fact that it has satisfied selection criteria for the review (the broader the approach to selection, the more often this problem will arise). In such cases it is highly desirable to try to contact one of the authors of the study to ask if the authors are able and willing to supply additional information. In our experience, authors are often very willing to help despite the extra burden, especially if it means that their study will be selected for the systematic review. Inevitably, however, this process takes time.

6. ANALYSIS AND PRESENTATION OF RESULTS
By this stage, the study table will have been completed as far as possible. In general, there are two types of analysis. First, it is valuable to describe, often in simple terms, the findings of the review with respect to the results and methodologic features in the articles selected (which by inference describes the features of those articles that were rejected). Second, it may be desirable to perform a meta-analysis. Whether or not meta-analysis is possible will depend on the study characteristics and the nature of the data.

The performance of a diagnostic test can be summarized in a variety of ways, but sensitivity and specificity are perhaps most familiar to radiologists. With our example of CT colonography, in good studies authors will have performed the experimental test (ie, CT) in patients whose true disease status (ie, whether patients have polyps) has been established by means of an independent reference test (ie, colonoscopy). Also in good studies authors will have performed CT colonography in both patients with polyps and patients without polyps. There are thus four possible outcomes for the results of the new test: Patients with polyps that are correctly detected at CT colonography have a true-positive (TP) finding with respect to the new test, and patients without polyps who also have a negative result at CT colonography have a true-negative (TN) finding. Few tests are perfect, and so there are inevitably two further outcomes: Patients with polyps that are missed at CT colonography will have a false-negative (FN) finding, and patients with no polyps, but in whom CT findings suggest there is an abnormality, have a false-positive (FP) finding. These four test characteristics can be combined in the form of a 2 x 2 table (Table 2) and a variety of summary statistics can be calculated. Sensitivity can be understood simply as the extent to which the new test helps correctly identify a patient with disease. Similarly, specificity can be understood as the extent to which the new test rejects patients who do not have disease: sensitivity = TP/(TP + FN) and specificity = TN/(TN + FP).


View this table:
[in this window]
[in a new window]

 
Table 2. Per-Patient Data from a Single-Center Study Comparing CT Colonography with Colonoscopy

 
An example from a study of CT colonography (42) is given in Table 2. The numbers from this table give a sensitivity of 0.69 (calculated as 20/[20 + 9]) and specificity of 0.84 (calculated as 21/[21 + 4]).

There are also other summary statistics that can be calculated that are less familiar than sensitivity and specificity but perhaps more informative. For the result of any diagnostic test, we can calculate the probability of getting that result if the patient truly has the disease with the corresponding probability if they are healthy. The ratio of these two probabilities is the likelihood ratio. For our example of CT colonography, the result of the test is binary—the patient either has polyps or does not. Such tests have a positive LR (LRpos) and a negative LR (LRneg), which describe the discriminatory powers of a positive and negative test result, respectively: LRpos = sensitivity/(1 – specificity) and LRneg = (1 – sensitivity)/specificity.

A positive likelihood ratio greater than 10 and a negative likelihood ratio less than 0.1 provide "convincing" diagnostic evidence; a positive likelihood ratio greater than 5 and a negative likelihood ratio less than 0.2 provide "strong" diagnostic evidence (43). Again, by using the data in Table 2, LRpos = 4.3 (calculated as 0.69/[1 – 0.84]) and LRneg = 0.37 (calculated as [1 – 0.69]/0.84).

The odds ratio is another possible summary statistic. While probability refers to the fraction of times you would expect to see a result (and so ranges from 0 to 1), odds are defined as the probability that an event will occur divided by the probability that it will not occur (and so ranges from 0 to infinity). The diagnostic odds ratio refers to the odds of a positive test result in patients with the disease compared with the odds of the same result in patients without the disease. It combines the likelihood ratios in a single statistic: diagnostic odds ratio = LRpos/LRneg.

With data from Table 2, the diagnostic odds ratio for CT colonography is 11.6 (4.3 divided by 0.37). The diagnostic odds ratio is difficult to apply in clinical practice but is convenient when combining studies for a systematic review, since the result tends to be relatively independent of diagnostic threshold, the importance of which is described below. In contrast, the likelihood ratio has considerable practical value since it can be used to quantify increased certainty associated with a positive diagnosis. For this we need to know the prevalence of the disease in the population being studied (ie, what percentage of subjects actually have the disease in question). The odds of having the disease before the test is performed (pretest odds) is therefore: prevalence/(1 – prevalence). The odds of having the disease after the test is performed (posttest odds) is determined as the pretest odds multiplied by the positive likelihood ratio. Likewise, the odds of not having the disease after the test is performed is determined as the pretest odds multiplied by the negative likelihood ratio. Thus the likelihood ratio measures the change in certainty of the diagnosis.

Many studies of diagnostic imaging are performed by using ordered categories to account for differing diagnostic confidence. For example, in a study of CT colonography (44), five categories were used to assess readers' diagnostic confidence regarding polyps: no polyp, possibly a polyp, equivocal, probably a polyp, and definitely a polyp. Sensitivity and specificity can thus be calculated at each of these diagnostic thresholds and the results at each threshold expressed as a single 2 x 2 table (ie, categories above and below an individual threshold are combined). Alternatively, receiver operating characteristic (ROC) curves, familiar to many radiologists, graphically depict the sensitivity and specificity of the test at each diagnostic threshold by plotting the true-positive rate against the false-positive rate: Good tests have curves that rise steeply and pass close to the top left-hand corner, where both sensitivity and specificity equal 1. ROC curves are especially appropriate when the test result is a measurement.

So, we have a variety of summary statistics that can describe the performance of a diagnostic test—sensitivity, specificity, likelihood ratios, diagnostic odds ratio, and ROC curve. Synthesizing these for a meta-analysis is a two-stage process. First, summary statistics are derived for each individual study. These are then pooled together to obtain a single overall estimate across all studies. For our example of CT colonography, our researcher is interested in an overall estimate of sensitivity and specificity for polyp detection. Table 2 shows a 2 x 2 table of data from a single study of CT colonography (42), and sensitivity and specificity derived from this data were detailed previously. Similar tables must be populated as far as possible for each of the individual primary studies selected for the systematic review. Once this has been done, a weighted average can be computed across them.

While it is important to have some basic understanding of the summary statistics that are being pooled in a meta-analysis, we do not intend to describe here the mathematical process since it is well beyond the scope of this article. However, there are some important points that should be kept in mind. Perhaps most important of these is the concept of heterogeneity. Statistical heterogeneity describes the variability found in the results of the individual studies, and the degree determines, to some extent, the choice of meta-analytical method. Heterogeneity can be so profound that formal meta-analysis is illogical, indicating that the studies are too varied to be combined meaningfully.

It is important to understand the difference between statistical heterogeneity (ie, variation in the results of the primary studies) and methodologic heterogeneity (ie, diversity in the primary studies). Methodologic heterogeneity (eg, heterogeneity because of factors related to patient selection, study design, or the reference test) may or may not cause statistical heterogeneity. In any event, a meta-analysis of studies with marked statistical heterogeneity is likely to be worthless. Meta-analysis is most appropriate when the component studies have examined similar patients by using comparable methods and reference standards. An ideal meta-analysis would be performed by using completely homogeneous studies so that the results from each individual study are mathematically perfectly compatible with those of any of the others. Not surprising, this homogeneity is rarely achieved in practice, and a degree of heterogeneity is inevitable.

Heterogeneity of component studies can be determined by using a variant of the {chi}2 test, which is used to assess whether individual sensitivities and specificities are comparable. A simple visual assessment can also be made by plotting the sensitivities and specificities from individual studies as an ROC plot (31,32). While some divergence will be inevitable as a result of chance variation, marked divergence may be due to differing patient populations or biases—experimental or otherwise.

However, generally, unlike studies of treatment effects, it is important to realize that heterogeneity between studies of diagnostic tests may actually be caused by differences in the diagnostic thresholds used to define a positive result—the "threshold effect." For example, in some studies of CT colonography, only polyps 1 cm or larger have been considered, while in others all sizes of polyps have been investigated (2). Obviously we would expect detection to be poorer for small polyps, and a comparison between studies that analyzed all polyps and those that restricted analysis to the largest polyps is clearly inappropriate without accounting for this difference. Moreover, even if all studies used the same size cutoff points, there may still be differences in diagnostic thresholds because of disparity in the technique used to measure the polyp. For example, polyps may have been measured before or after polypectomy.

It is interesting to note that, where variation between studies is caused mainly by differences in diagnostic thresholds, the data points from individual studies tend to have curvatures that parallel the underlying ROC curve for the test in question (31). If observed heterogeneity is due to differing diagnostic thresholds, then summary estimates of sensitivity and specificity will underestimate the capabilities of the test. In this situation, the best approach to meta-analysis is to combine the ROC curves from each individual study in a procedure that creates a single summary ROC curve. Statistical methods for deriving the best-fitting summary ROC curve are necessarily complex and relatively immature (32) and certainly well beyond the scope of this article.

If we imagine that our would-be colonographer has had his data analyzed by a friendly statistician, he might be presented with a forest plot, a well-established format for pictorial representation of a meta-analysis. Figure 4 shows a forest plot for the sensitivity of CT colonography for per-patient detection of colorectal polyps measuring 5 mm or larger. There are 13 primary studies, each represented by a horizontal line with a central black square. The square indicates the point estimate of the study and its area indicates the relative contribution the study makes to the meta-analysis. The horizontal line represents the 95% confidence intervals around the point estimate. Underneath all of these, the point estimate and confidence intervals for the meta-analysis of all 13 component studies are presented as a diamond (point estimate, 0.81; 95% confidence interval: 0.78, 0.83) in this example (Fig 4). Clearly, such a presentation and analysis is possible (and desirable) for specificity and also at different cutoff points—for example, restricted only to polyps measuring 1 cm or larger.


Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 4: Sample forest plot of per-patient diagnostic sensitivity of CT colonography for colorectal polyps 5 mm or larger. Data are derived from 13 individual studies. {blacksquare} = point estimate of each study (area indicates relative contribution of the study to meta-analysis); horizontal line = 95 % confidence interval (95 % CI); {diamond} = overall point estimate for meta-analysis of all component studies. In this example, the overall point estimate is 0.81 (95 % confidence interval: 0.78, 0.83).

 
The meta-analysis presented in Figure 4 is relatively simple; it is effectively a weighted average across all 13 component studies. As described previously, there may be heterogeneity due to variation in the diagnostic threshold used between studies and, as we have said, a much more sophisticated approach would be to fit a summary ROC curve across all studies to account for this variation. Such a summary ROC plot for seven studies of CT colonography is shown in Figure 5. As in Figure 4, the analysis in Figure 5 refers to per-patient detection of polyps measuring 5 mm or larger, but in this case the meta-analysis of paired sensitivity and specificity was conducted by using a hierarchical model that accounts for differing diagnostic thresholds (PROC NLMIXED within SAS; SAS Institute, Cary, NC) (45), and the summary ROC curve plotted by using Meta-DiSc (46). The model estimates the average threshold and diagnostic odds ratio, as well as their variability, and allows the summary ROC curve to have both symmetric and asymmetric shapes (45). It can be seen from the summary ROC plot that CT colonography appears to be a good test for per-patient detection of polyps 5 mm and larger—the curve rises steeply and passes near the upper left corner of the plot. It is also possible to calculate the average operating point, which is the point on the summary ROC curve that represents sensitivity and specificity at the average threshold together with 95% confidence intervals. In this example, the operating point has an average sensitivity of 86.4% (95% confidence interval: 74.8%, 93.2%; range, 78.9%–100%) and an average specificity of 86.1% (95% confidence interval: 75.5%, 92.5%; range, 55.0%–100%). Again, a variety of curves at different cutoff points are possible assuming that the component studies are not too heterogeneous as to preclude meta-analysis at different points. Summary statistics that are more consistent across studies, such as the diagnostic odds ratio and summary ROC curve, are difficult to apply to day-to-day clinical practice because some knowledge of the sensitivity and specificity of the test in the population for which the test is intended is still required. Also, the area under the ROC curve is not recommended as a summary measure for diagnostic tests, because tests with very different clinical characteristics can have identical areas under the ROC curve. It is most useful clinically to reporting sensitivity and specificity values.


Figure 5
View larger version (7K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 5: Summary ROC curve of sensitivity versus 1 – specificity. Data are derived from seven studies of CT colonography (2).

 
Our researcher has collected data on more than just sensitivity and specificity. For example, he has information on which studies used an intravenously administered spasmolytic agent and which used experienced observers. He is therefore in a position to perform subset analyses to compare sensitivity and specificity between these groups in order to see if there is any difference and in what direction this difference is. However, like subgroup analysis within single studies, interpretation must be cautious (47). Our researcher also has data relating to methodologic quality and completeness of data reporting. He is able to state with certainty how good these data are when judged against objective criteria (eg, Standards for Reporting of Diagnostic Accuracy, or STARD [48], and Quality Assessment of Studies of Diagnostic Accuracy included in Systematic Reviews, or QUADAS [49]) and now has a clear "feel" for the quality of research in the field. Perhaps it is now time to write a formal report?


    INTERPRETATION AND CONSEQUENCES OF SYSTEMATIC REVIEW AND META-ANALYSIS
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
So, after all his hard work, our aspirational colonographer is armed with some data that have a reasonable chance of being useful. He has derived these data from studies that are methodologically sound and homogeneous enough to be combined in a meta-analysis. For example, he now knows that the per-patient average sensitivity and specificity of CT colonography for medium and large colorectal polyps is around 86% (2); the data are derived from far more patients than exist in any individual study. How might this result be interpreted?

Sensible interpretation of meta-analytical point estimates revolve around how reliable they are likely to be in practice. In essence, are they believable? To an extent, the confidence intervals around the estimates indicate their precision. For example, a recent meta-analysis of MR colonography showed a sensitivity of 75% but with wide confidence intervals of 47% and 91% and component study sensitivities ranging from 8.6% to 100% (50). However, even if confidence intervals are tight, it is important to realize that they do not indicate study quality. It is here that the systematic review component is most helpful. For example, our reviewer has discovered that the reporting quality for studies of CT colonography is generally poor. For example, a fully populated 2 x 2 contingency table for per-patient data could be extracted from the published article for only 50% of studies despite the fact that this is the central analysis and these were studies of quality sufficient to make the review (2). Our observer wants to use CT for screening, but in most studies symptomatic patients were examined, which undoubtedly effects the generalizability of point estimates. Also, most primary studies were performed by researchers from specialist centers, which also limits generalizability. It is time for our researcher to write a formal report of his systematic review describing what information he has found and, perhaps more important, what he has not been able to find. From this will emerge recommendations for future research.

While systematic review and meta-analysis often go hand in hand, it is unfortunate that the meta-analytical component tends to "hog the limelight." This is understandable: Non-statisticians perceive the analysis as mysterious and difficult, and it is also conceptually attractive to boil the whole review down to a single "point estimate" of performance. However, while formal meta-analysis is probably not something the average radiologist should attempt, it is rather easy for the average statistician when presented with a completed data table. We would like to stress above all else that it is the systematic review component that is the most taxing but potentially the most useful. An assessment of the quality of published research findings helps indicate exactly where there is a need for improved methods and data reporting and where there is a shortage of good available evidence. This is most apparent during data extraction, since it is at this point that the inadequacy of existing research is most visible.

This brings us back to the statement made earlier that there is a widely held belief that since the primary research has "already been done," systematic review and meta-analysis can be performed quickly. Good systematic reviews are hardly ever completed quickly when performed thoroughly because we do not live in an ideal world—the quality of data reporting is variable despite guidelines for the presentation of studies of diagnostic accuracy (48,49). Good systematic reviews get right to the heart of specific areas of research: What is the quality of work available in the field? Is it adequate enough for us to gather the information we need to inform day-to-day clinical decision-making? If not, what is needed to put things right? The answers to these questions are at least as important as a mathematical point estimate. On the basis of his experiences, our would-be colonographer may feel obliged to construct a "minimum data set," specific to studies of CT colonography, that describes exactly what methods are required and how data should be presented so that pertinent information can be readily obtained in the future.


    APPLYING RESULTS OF SYSTEMATIC REVIEW IN PRACTICE: A "BOTTOM-UP" PERSPECTIVE
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
Conventionally, academic researchers produce systematic reviews, which are then distributed to "grass roots" practitioners—a "top-down" approach. However, it is increasingly being advocated that such practitioners should learn how to find the best current literature, appraise it themselves, and apply it to their own circumstances—a "bottom-up" approach, which has been described in this series previously (51). Systematic reviews encompass the first three principles of evidence-based practice: Ask, search, and appraise (52). The fourth principle—apply—is the next logical step; that is, the findings of the systematic review are combined with clinical judgment, patient factors, and local circumstances. Systematic reviews can be identified with the "Systematic Review" search engine found by using the Clinical Queries link on the PubMed Services sidebar, and the quality of these systematic reviews can be assessed by using evidence-based methods (see section below, Quality Assessment of Systematic Reviews).

For medium-sized and larger polyps, our colonography researcher has found an average per-patient sensitivity and specificity of 86% and 86%. These estimates can be used with the pretest probabilities of having a polyp to determine the posttest probability in light of the test result, by using Bayes theorem. For example, the prevalence of clinically important polyps in patients with and in those without a positive family history can be estimated by means of a literature search (53). If our researcher assumes that the pretest probability of adenomas is 5% in those patients without a family history and 15% in those with a family history and then inputs these data along with point estimates from the systematic review into a spreadsheet designed to generate conditional probabilities from summary statistics of diagnostic tests (54), then a negative CT colonography result reduces these probabilities to 0.008% and 0.027% for those without and those with a family history, respectively. These estimates can be given to referring clinicians and integrated with patient factors (preference for one test over another, or comorbidity, for example). Similar calculations could also be performed for barium enema examination and colonoscopy. It is important to remember that the goalposts are always moving, especially for imaging technologies, and the literature needs to be reviewed periodically until equipoise has been reached.


    LIMITATIONS AND PROBLEMS OF SYSTEMATIC REVIEWS AND META-ANALYSIS
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
Meta-analysis has had its supporters and detractors from the outset. While ardent advocates suggest that there is now no role for the traditional narrative review (55), others suggest that combining the results of different studies, performed with different patients, in different hospitals with different equipment, in different countries, and at different times, is likely to be worthless (56). In particular, the quality of component trials is critical to the results. As we have said already, "Rubbish in, rubbish out." Famously, results of one meta-analysis showed incorrectly that administering magnesium after myocardial infarction was beneficial, a finding that was subsequently blamed on the poor quality of the component trials, their heterogeneity, and publication bias (57). Systematic reviews on the same topic have been known to reach different conclusions (58), which is something that the whole process is supposed to eliminate!

Articles are most visible if they are published in indexed journals, but publication is sometimes dependent on whether the results are positive (ie, the new test or treatment "works") (59). Such trials are also more likely to be published in English, more likely to be published rapidly and repeatedly, and more likely to be cited (59). It is also recognized that the topics selected for systematic review are influenced by prior knowledge of the results of the component studies—if the findings from CT colonography studies were uniformly poor, there would be little inclination to perform a systematic review. It is interesting to note that, at the time of writing this report, no studies have been performed to examine publication bias in studies of imaging tests.

There is also evidence that studies of diagnostic tests are generally less methodologically vigorous than are studies of treatment effects in randomized controlled trials (60), which makes the combination of studies of diagnostic tests more problematic. For example, we have already noted the generally poor assessment of MR imaging (33), and although improving, methods for studies of diagnostic tests still lag behind those employed in randomized trials (61). Furthermore, statistical methods for meta-analysis of measures of diagnostic accuracy are less developed than those for assessment of treatment effects. For example, it is unknown whether the simpler methods used to pool estimates of diagnostic performance are actually misleading in practice, and more research is required to determine the exact influence of threshold effects on heterogeneity (31,32).

We have tried to explain how to identify and select articles for a systematic review with as little bias as possible. Researchers will inevitably disagree about the minimum methodologic standard necessary for inclusion. For example, some might argue that a systematic review of CT colonography should be restricted to studies with multi–detector row scanners while others might consider this less important than other factors. Because of this disagreement, it is vital that the review team includes at least one expert familiar with the technology being assessed so that decisions are informed. Successful systematic reviews require a multitude of different skills, and the composition of the review team should reflect this.

We should also be careful not to dismiss expert opinion. Experts may actually know something worthwhile and have something useful to say! Narrative reviews remain popular with readers, not least in radiology (3). Small trials are not always "poor" trials and can divulge valuable information, not least by providing the raw material for meta-analysis and modeling (15). Findings of such trials may be the only source of information regarding some interventions. It is also important not to dismiss radiologic research just because the methods are not as rigorous as those found elsewhere. There is an important distinction between the best evidence in theory and the best available evidence in practice.


    QUALITY ASSESSMENT OF SYSTEMATIC REVIEWS
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
Like the articles on which they are based, the quality of systematic reviews is also variable. We have already noted that findings of systematic reviews addressing the same topic may be in disagreement (57). Researchers may also arrive at essentially the same mathematical result after meta-analysis but draw different conclusions or emphases. For example, three systematic reviews of CT colonography (2,62,63) all resulted in similar point estimates, but in only one of these reviews did the authors describe difficulty extracting pertinent data from the available literature and feel compelled to propose a minimum data set for reporting of future studies (2).

Just because a review is "systematic" does not mean it is automatically superior. The methodologic quality of an individual systematic review may be judged just as easily as the primary studies from which it is derived and may be judged by using evidence-based practice methodology. For example, how comprehensive and unbiased was the search for studies? Are the databases described, along with the timescale, the search terms, and the number of researchers? Were the inclusion and exclusion criteria stated explicitly, are they reasonable, and are they discussed? Are the demographic data of the patient populations and clinical settings described for the component studies? Are the definitions of a positive finding and a negative finding given for both the new test and the reference test? What was their temporal separation? Are technical failures reported? Does the review present 2 x 2 tables of extracted binary data or cutoff points for continuous data? If meta-analysis has been performed, was this a reasonable thing to do and was heterogeneity of the primary studies tested for (and, if so, how)? Are the methods of the meta-analysis described adequately? How many studies and patients were included in the meta-analysis? Are the reasons for exclusions stated? If there was no attempt at performing a meta-analysis, why not? This list could go on but makes the point that not all systematic reviews are equal (6466). Systematic reviews of systematic reviews are already here (67)!


    THE COCHRANE COLLABORATION
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
No article on systematic reviews would be complete without mentioning the Cochrane Collaboration (28). In 1972, the British epidemiologist Archie Cochrane drew attention to the fact that unbiased and reliable information about the effects of health care interventions was limited and people could not make informed decisions (68). His book stimulated discussion that resulted in the assembly of controlled trials of perinatal medicine into a single register, which by 1985 contained over 3500 trials and facilitated 600 systematic reviews. It rapidly became clear that all areas of health care could benefit from this approach, and so the UK National Health Service funded a "Cochrane Centre" in Oxford in 1982 to facilitate preparation of systematic reviews of randomized trials of health care. The New York Academy of Sciences spread the idea around the world and was followed by the Cochrane Collaboration, "an international organization that aims to help people make well-informed decisions about healthcare by preparing, maintaining, and promoting the accessibility of systematic reviews of the effects of healthcare interventions" (28).

The development of the Cochrane Collaboration logo makes for interesting reading: The logo illustrates a forest plot of results of seven randomized controlled trials on the effects of administering steroids to women who were about to give birth prematurely. The first trial was reported in 1972 and the last in 1980, and the point estimate derived from these shows the clear benefit of giving steroids had these trials been reviewed systematically at that time. Unfortunately, a systematic review was not performed until a decade later, and thousands of babies died despite the availability of information that could have saved them.

The main work of the Cochrane Collaboration is performed by Collaborative Review Groups supported by the Cochrane Centres around the world. The groups prepare and maintain the Cochrane Reviews, which are held in a central repository, and the Cochrane Library, which may be purchased on a CD-ROM or subscribed to on the Internet (Appendix). There are also method groups, who advise on statistical methods, and extensive consumer groups. There are considerable logistic, methodologic, and ethical challenges facing the Cochrane Collaboration, but the alternative is to make decisions about health care without regard to the best available evidence. At the time of writing this report, only randomized controlled trials are reviewed, but it is likely that they will extend into assessments of diagnostic accuracy in the near future.


    APPENDIX
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 
Information about the Cochrane Collaboration can be found at http://www.cochrane.org. The abstracts of the Cochrane Reviews can be searched online free of charge. Some countries (eg, those in South America and low-income countries) have free full access and others (eg, England) have free access for health professionals. The Web site also contains summaries for consumers and "plain English" discussions of the results of many reviews.

For further information, readers are directed toward the book Systematic Reviews in Health Care (11).


    FOOTNOTES
 

Abbreviations: ROC = receiver operating characteristic

Authors stated no financial relationship to disclose.


    References
 TOP
 ABSTRACT
 WHAT IS SYSTEMATIC REVIEW...
 META-ANALYSIS, RANDOMIZED...
 HOW ARE SYSTEMATIC REVIEWS...
 INTERPRETATION AND CONSEQUENCES...
 APPLYING RESULTS OF SYSTEMATIC...
 LIMITATIONS AND PROBLEMS OF...
 QUALITY ASSESSMENT OF SYSTEMATIC...
 THE COCHRANE COLLABORATION
 APPENDIX
 References
 

  1. Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. Evidence-based medicine: what it is and what it isn't. BMJ 1996;312:71–72.[Free Full Text]
  2. Halligan S, Altman DG, Taylor SA, et al. CT colonography in the detection of colorectal polyps and cancer: systematic review, meta-analysis, and proposed minimum dataset for study level reporting. Radiology 2005;237:893–904.[Abstract/Free Full Text]
  3. Williams CJ. The pitfalls of narrative reviews in clinical medicine. Ann Oncol 1998;9:601–605.[Free Full Text]
  4. Schoepf UJ, Costello P. CT angiography for diagnosis of pulmonary embolism: state of the art. Radiology 2004;230:329–337.[Abstract/Free Full Text]
  5. Sackett DL. The sins of expertness and a proposal for redemption [editorial]. BMJ 2000;320:1283.[Free Full Text]
  6. Who are the experts, where is the expertise? Drug Ther Bull 2002;40:55–56.[Abstract/Free Full Text]
  7. Gotzsche PC. Reference bias in reports of drug trials. Br Med J (Clin Res Ed) 1987;295:654–656.[Medline]
  8. Glick S. Double-contrast barium enema for colorectal ancer screening: a review of the issues and a comparison with other screening alternatives. AJR Am J Roentgenol 2000;174:1529–1537.[Free Full Text]
  9. Fletcher RH. The end of barium enemas? N Engl J Med 2000;342:1823–1824.[Free Full Text]
  10. Discussion session. Presented at the Second ESGAR CT Colonography Workshop. Rome, Italy, September 17th, 2004.
  11. Eccles M, Freemantle N, Mason J. Using systematic reviews in clinical guideline development. In: Egger M, Davey Smith G, Altman DG, eds. Systematic reviews in health care. 2nd ed. London, England: BMJ Publishing Group, 2001; 402.
  12. Taylor SA, Halligan S, Goh V, et al. Optimizing colonic distention for multi–detector row CT colonography: effect of yyoscine butylbromide and rectal balloon catheter. Radiology 2003;229:99–108.[Abstract/Free Full Text]
  13. Bruzzi JF, Moss AC, Brennan DD, MacMathuna P, Fenlon HM. Efficacy of IV Buscopan as a muscle relaxant in CT colonography. Eur Radiol 2003;13:2264–2270.[CrossRef][Medline]
  14. Altman DG. The scandal of poor medical research. BMJ 1994;308:283–284.[Free Full Text]
  15. Lilford R, Stevens AJ. Underpowered studies. Br J Surg 2002;89:129–131.[Medline]
  16. Hillman BJ. Outcomes research and cost-effectiveness analysis for diagnostic imaging. Radiology 1994;193:307–310.[Free Full Text]
  17. Dixon AK. Evidence-based diagnostic radiology. Lancet 1997;350:509–512.[CrossRef][Medline]
  18. Song FJ, Fry-Smith A, Davenport C, Bayliss S, Adi Y, Wilson JS. Identification and assessment of ongoing trials in health technology assessment reviews. Health Technol Assess 2004;8:1–87.[Medline]
  19. Sterling TD. Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J Am Stat Assoc 1959;54:30–34.[CrossRef]
  20. Fenlon HM, Nunes DP, Schroy PC 3rd, Barish MA, Clarke PD, Ferrucci JT. A comparison of virtual and conventional colonoscopy for the detection of colorectal polyps. N Engl J Med 1999;341:1496–1503.[Abstract/Free Full Text]
  21. Rex DK, Vining D, Kopecky KK. An initial experience with screening for colon polyps using spiral CT with and without CT colography (virtual colonoscopy). Gastrointest Endosc 1999;50:309–313.[CrossRef][Medline]
  22. Pickhardt PJ, Choi JR, Hwang I, et al. Computed tomographic virtual colonoscopy to screen for cororectal neoplasia in asymptomatic adults. N Engl J Med 2003;349:2191–2200.[Abstract/Free Full Text]
  23. Rockey DC, Paulson E, Niedzwiecki D, et al. Prospective comparison of colon imaging tests: a determination of the relative sensitivity of air contrast barium enema, computed tomographic colonography, and colonoscopy. Lancet 2005;365(9456):305–311.
  24. Halligan S, Atkin W. Unbiased studies are needed before CT colonography can be dismissed. Lancet 2005;365:275–276.[Medline]
  25. Egger M, Davey Smith G, Altman DG, eds. Systematic reviews in health care. 2nd ed. London, England: BMJ Publishing Group, 2001.
  26. Greenhalgh T. Papers that summarise other papers (systematic reviews and meta-analyses). BMJ 1997;315:672–675.[Free Full Text]
  27. Staunton M. Evidence-based radiology: steps 1 and 2—asking answerable questions and searching for evidence. Radiology 2007;242:23–31.[Abstract/Free Full Text]
  28. Chalmers I. The Cochrane Collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care. Ann N Y Acad Sci 1993;703:156–165.[Medline]
  29. Furberg CD. Effect of antiarrhythmic drugs on mortality after myocardial infarction. Am J Cardiol 1983;52:32C–36C.[CrossRef][Medline]
  30. Chalmers I. Foreword. In: Egger M, Davey Smith G, Altman DG, eds. Systematic reviews in health care. 2nd ed. London, England: BMJ Publishing Group, 2001; 13.
  31. Deeks JJ. Systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001;323:157–162.[Free Full Text]
  32. Deeks JJ. Systematic reviews of evaluations of diagnostic and screening tests. In: Egger M, Davey Smith G, Altman DG, eds. Systematic reviews in health care. 2nd ed. London, England: BMJ Publishing Group, 2001; 248–282.
  33. Cooper LS, Chalmers TC, McCally M, Berrier J, Sacks HS. The poor quality of early evaluations of magnetic resonance imaging. JAMA 1988;259:3277–3280.[Abstract/Free Full Text]
  34. Barish MA. Consensus statement. In: Proceedings of the Fourth International Symposium on Virtual Colonoscopy. Boston, Mass: Trustees of Boston University, 2003; 137–143.
  35. Rex DK, Cutler CS, Lemmel GT, et al. Colonoscopic miss rates of adenomas determined by back-to-back colonoscopies. Gastroenterology 1997;112:24–28.[CrossRef][Medline]
  36. MEDLINE Web site. U.S. National Library of Medicine. http://medline.cos.com/. Accessed November 3, 2005.
  37. EMBASE Web site. http://www.embase.com. Accessed November 3, 2005.
  38. NCBI PubMed Web site. U.S. National Library of Medicine. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi. Accessed November 3, 2005.
  39. Vining D, Gelfand DW, Bechtold RE, Scharling ES, Grishaw EK, Shifrin RY. Technical feasbility of colon imaging with helical CT and virtual reality [abstr]. AJR Am J Roentgenol 1994;162(suppl):S104.
  40. Cancer literature in PubMed (CancerLit). National Cancer Institute Web site. http://www.cancer.gov/search/cancer_literature/. Accessed November 3, 2005.
  41. Cochrane Collaboration Web site. http://www.cochrane.org. Accessed November 3, 2005.
  42. Taylor SA, Halligan S, Saunders BP, et al. Use of multidetector-row CT colonography for detection of colorectal neoplasia in patients referred via the department of health "2-week-wait" initiative. Clin Radiol 2003;58:855–861.[CrossRef][Medline]
  43. Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The Evidence-Based Medicine Working Group. JAMA 1994;271:703–707.
  44. Johnson CD, Toledano AY, Herman BA, et al. Computerized tomographic colonography: performance evaluation in a retrospective multicentre setting. Gastroenterology 2003;125:688–695.[CrossRef][Medline]
  45. Macaskill P. Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis. J Clin Epidemiol 2004;57:925–932.[CrossRef][Medline]
  46. Zamora J, Muriel A, Abraira V. Meta-DiSc for Windows: a software package for the meta-analysis of diagnostic tests. XI Cochrane Colloquium, Barcelona, 2003. http://www.hrc.es/investigacion/metadisc.html. Accessed February 6, 2007.
  47. Davey Smith G, Egger M, Phillips AN. Meta-analysis: Beyond the grand mean? BMJ 1997;315:1610–1614.[Free Full Text]
  48. Bossuyt PM, Reitsma JB, Bruns DE, et al. Toward complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Radiology 2003;226:24–28.[Abstract/Free Full Text]
  49. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, Kleijnen J. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews [editorial]. BMC Med Res Methodol 2003;3:25.[CrossRef][Medline]
  50. Purkayastha S, Tekkis P, Athansiou T, et al. Magnetic resonance colonography versus colonoscopy as a diagnostic modality for colorectal cancer: a meta-analysis. Clin Radiol 2005;60:980–989.[CrossRef][Medline]
  51. Malone DE. Evidence-based practice in radiology: an introduction to the series. Radiology 2007;242:12–14.[Free Full Text]
  52. Introduction. In: Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB, eds. Evidence-based medicine, how to practice and teach EBM. 2nd ed. Edinburgh, Scotland: Churchill Livingstone, 2000; 3–4.
  53. Elwood JM, Ali G, Schlup MM, et al. Flexible sigmoidoscopy or colonoscopy for colorectal screening: a randomized trial of performance and acceptability. Cancer Detect Prev 1995;19:337–347.[Medline]
  54. Maceneaney PM, Malone DE. The meaning of diagnostic test results: a spreadsheet for swift data analysis. Clin Radiol 2000;55:227–235.[CrossRef][Medline]
  55. Chalmers TC, Frank CS, Reitman D. Minimizing the three stages of publication bias. JAMA 1990;263:1392–1395.[Abstract/Free Full Text]
  56. Eysenck HJ. Problems with meta-analysis. In: Chalmers I, Altman DG. Systematic reviews. London, England: BMJ Publishing Group, 1995; 64–74.
  57. Egger M, Davey-Smith G. Misleading meta-analysis: lessons from "an effective, safe, simple" intervention that wasn't. BMJ 1995;310:752–754.[Free Full Text]
  58. Jadad AR, Cook DJ, Browman GP. A guide to interpreting discordant systematic reviews. CMAJ 1997;156:1411–1416.[Abstract]
  59. Egger M, Dickersin K, Davey-Smith G. Problems and limitations in conducting systematic reviews. In: Egger M, Davey Smith G, Altman DG, eds. Systematic reviews in health care. 2nd ed. London, England: BMJ Publishing Group, 2001; 43–68.
  60. Irwig L, Tosteson AN, Gatsonis CA, et al. Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 1994;120:667–676.[Abstract/Free Full Text]
  61. Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research: getting better but still not good. JAMA 1995;274:645–651.[Abstract/Free Full Text]
  62. Sosna J, Morrin MM, Kruskal JB, Lavin PT, Rosen MP, Raptopoulos V. CT colonography of colorectal polyps: a meta-analysis. AJR Am J Roentgenol 2003;181:1593–1598.[Abstract/Free Full Text]
  63. Mulhall BP, Veerappan GR, Jackson JL. Meta-analysis: computed tomographic colonography. Ann Intern Med 2005;142:635–650.[Abstract/Free Full Text]
  64. Oxman AD, Cook D, Guyatt GH. Users' guides to the medical literature. VI. How to use an overview. Evidence-Based Medicine Working Group. JAMA 1994;272:1367–1371.
  65. McGovern DP. Systematic reviews. In: Levi M, ed. Key topics in evidence-based medicine. Oxford, England: Scion, 2001; 17–19.
  66. Oxman A, Guyatt G, Cook D, Montori V. Summarizing the evidence. In: Guyatt G, Rennie D, eds. User's guides to the medical literature: a manual for evidence-based practice. Chicago, Ill: American Medical Association Press, 2002; 155–173.
  67. Mallett S, Summerton N, Deeks J, Halligan S, Altman DG. Systematic reviews of diagnostic tests in cancer: assessment of methodology and reporting quality. Presented at the XI Cochrane Colloquium: Evidence, Health Care and Culture, Barcelona, Spain, October 26–31, 2003.
  68. Cochrane A. Effectiveness and efficiency: random reflections on health services. London, England: Nuffield Provincial Hospital Trust, 1972.



This article has been cited by other articles:


Home page
RadiologyHome page
E. J. Heffernan, J. D. Dodd, and D. E. Malone
Cardiac Multidetector CT: Technical and Diagnostic Evaluation with Evidence-based Practice Techniques
Radiology, August 1, 2008; 248(2): 366 - 377.
[Abstract] [Full Text] [PDF]


Home page
J Am Coll CardiolHome page
A. N. DeMaria
Meta-analysis.
J. Am. Coll. Cardiol., July 15, 2008; 52(3): 237 - 238.
[Full Text] [PDF]


Home page
RadiologyHome page
M. Hamon, O. Lepage, P. Malagutti, J. W. Riddell, R. Morello, D. Agostini, and M. Hamon
Diagnostic Performance of 16- and 64-Section Spiral CT for Coronary Artery Bypass Graft Assessment: Meta-Analysis
Radiology, June 1, 2008; 247(3): 679 - 686.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
L. S. Medina and C. C. Blackmore
Evidence-based Radiology: Review and Dissemination
Radiology, August 1, 2007; 244(2): 331 - 336.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
W. Hollingworth and J. G. Jarvik
Technology Assessment in Radiology: Putting the Evidence in Evidence-based Radiology
Radiology, July 1, 2007; 244(1): 31 - 38.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
D. E. Malone and M. Staunton
Evidence-based Practice in Radiology: Step 5 (Evaluate)--Caveats and Common Questions
Radiology, May 1, 2007; 243(2): 319 - 328.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Halligan, S.
Right arrow Articles by Altman, D. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Halligan, S.
Right arrow Articles by Altman, D. G.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
RADIOLOGY RADIOGRAPHICS RSNA JOURNALS ONLINE