|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical Concepts Series |
1 From the Departments of Biostatistics and Epidemiology and Radiology/Wb4, the Cleveland Clinic Foundation, 9500 Euclid Ave, Cleveland, OH 44195. Received May 8, 2001; revision requested June 4; revision received June 21; accepted July 9. Address correspondence to the author (e-mail: nobuchow@bio.ri.ccf.org).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2003
Index terms: Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
Bias can occur in selecting the patients, images, and/or readers (ie, radiologists) for a study, in choosing and applying the reference-standard procedure, in performing and interpreting the tests, and in analyzing the results. Many of the biases encountered in radiologic studies have been given names, but there are many unnamed biases that we can identify and avoid by using common sense.
It is best to recognize the potential sources of bias while in the process of designing a study. Then, solutions to the bias problem, or at least ways to minimize the effect of the bias, can be implemented in the study. Note that having a large sample size may reduce the variability (ie, random error) (Table) of our estimates, but it is never a solution to bias (ie, systematic error).
|
| BIAS IN SELECTING THE PATIENT SAMPLE |
|---|
|
|
|---|
Selection bias occurs when external factors influence the composition of the sample to the extent that the sample does not represent the population (eg, in terms of patient types, the frequency of the patient types, or both). Spectrum bias (4) is a type of selection bias; it exists when the sample is missing important subgroups. A classic example of spectrum bias is that encountered in screening mammography studies to compare the accuracy of full-field digital mammography with that of conventional mammography. A very large sample size is required to perform a comparison of these two modalities because the prevalence of breast cancer in screening populations is very low. One strategy to reduce the sample size is to consider women who have positive conventional mammography results. These women return for biopsy, and at that time, full-field digital mammography can be performed. However, there is a serious problem with this strategy: The patients withnegative conventional mammography results (true- and false-negative cases) have been selected out. The consequence is that the sensitivity of conventional mammography will be greatly overestimatedmaking full-field digital mammography seem inferiorand the specificity of conventional mammography will be greatly underestimatedmaking full-field digital mammography seem superior.
Once a new diagnostic test is shown to be capable of yielding different results for "the sickest of the sick" and "the wellest of the well," it is time to challenge the test. We challenge a test by performing it in study patients for whom making a diagnosis is difficult (4). From the results of these studies, we can determine if the test will be reliable in a clinical population that includes both patients for whom it is easy and patients for whom it is difficult to make a diagnosis. However, because of spectrum bias, we still cannot measure, without bias, other variables such as the tests sensitivity and specificity.
Suppose now that we have a well-established test that we know from previous studies is reliable even for difficult-to-diagnose cases. We want to measure, for example, the tests sensitivity and specificity for a particular population of patients. Ideally, we would select our study patients by taking a random sample from the population of patients who present to their primary physician with a certain set of signs and symptoms. In fact, a random sample is the basis of the interpretation of P values calculated in statistical analyses. We then perform the well-established test in these patients and measure the tests sensitivity and specificity. These measurements will be generalizable (Table) to similar patients who present to their primary physicians with the same signs and symptoms.
Sometimes this ideal study design is not workable. Alternatively, for this well-established test, suppose we select our study patients from a population of individuals who are referred to the radiology department for the test. Such a sample is called a referred or convenience sample. These patients have been selected to undergo the test. Other patients from the population may not have been referred for the test, or they may have been referred at a different rate. It is usually impossible to determine the factors that influenced the evaluating physicians referral patterns. Thus, the measurements taken from a referred sample are generalizable only to the referring physicians in the study since other physicians will select different patients.
If we must use a referred sample, for example, to minimize costs, then we should at least carefully collect and record important patient characteristicsimportant in the sense that the measurements taken in the study might vary according to these characteristicsand the relative frequency of these characteristics. We should report the measurements obtained in patients with various characteristics (eg, report the tests sensitivity and specificity for patients with and those without symptoms). This will allow others to compare the characteristics of their patient population with the characteristics of the study sample to determine how generalizable the study results are to their radiology practice.
| BIAS IN SELECTING THE READER SAMPLE |
|---|
|
|
|---|
It can be challenging to obtain a truly representative sample of readers for studies. The problem is illustrated in the mammography study performed by Beam et al (6). They identified all of the American College of Radiologyaccredited mammography centers in the United States. There were 4,611 such centers in the United States at the time of the study. Then they randomly sampled 125 of the 4,611 centers and mailed letters to these centers to assess their willingness to participate. Only 50 centers (40%) agreed to take part in the study. One hundred eight radiologists from these 50 centers actually interpreted images for the study. There was a clear potential for bias because the highly motivated centers and readers may have been more likely to volunteer, and these centers and readers may not have been representative of the population. It is unclear how to overcome this type of bias.
| BIAS IN CHOOSING AND APPLYING THE REFERENCE-STANDARD TEST |
|---|
|
|
|---|
Imperfect standard bias occurs when the reference-standard procedure yields results that are not nearly 100% accurate. An example would be a study of the accuracy of head CT for the diagnosis of multiple sclerosis. If MR imaging were used as the reference-standard test, then the measures of the accuracy of CT would be biased (ie, probably too low in value) (7) because the accuracy of MR imaging in the diagnosis of multiple sclerosis is not near 100%.
Some might argue that there is no such thing as a "gold" standard. Even pathologic analysis results are not 100% accurate because, like radiology, pathology is an interpretative discipline. For all studies it is important to have operational standards (Table) that take into account the condition being studied, the objectives of the study, and the potential effects of any bias. Some common sense is needed as well (8).
There are various solutions to imperfect standard bias. First, we can choose a better reference-standard procedure, if one exists. For the multiple sclerosis study, we could follow up the patients for several months or years to establish a clinical diagnosis and use the follow-up findings as the reference standard for comparison with the results of CT. Sometimes, however, there is no reference-standard procedure. For example, suppose we want to estimate the accuracy of a new test for identifying the location in the brain that is responsible for epileptic seizures. There is no reference-standard test in this case. However, as an alternative to measuring the tests accuracy, we could frame the problem in terms of the clinical outcome (7): We could compare the test results with the patients seizure status after nerve stimulation to various locations and report the strength of this relationship. Such analysis can yield useful clinical information, even when the tests accuracy cannot be adequately evaluated.
Another solution is to use an expert panel to establish a working diagnosis. Thornbury et al (9) formed an expert panel to determine the diagnoses for patients who underwent MR imaging and CT for acute low back pain. The panel was given the patients medical histories, physical examination results, laboratory findings, treatment results, and follow-up information to decide whether a herniated disk was present. The determinations of the expert panel regarding the patients true diagnoses were used as the reference standards with which the MR imaging and CT results were compared. Note that the expert panel was not given the results of MR imaging or CT. This was planned to avoid incorporation bias, which occurs when the results of the diagnostic test(s) under evaluation are incorporatedin full or in partinto the evidence used to establish the definitive diagnosis (4).
A fourth solution to imperfect standard bias is to apply one of several statistical corrections (10). To apply these corrections, one must make some assumptions about the imperfect reference standard (eg, that its sensitivity and specificity are known) and/or the relationship between the results of the test being assessed and the results of the reference-standard test (eg, that the test in question and the reference-standard test make errors independently of one another). There is continuing research of new statistical methods for addressing imperfect standard bias.
In some studies, a reference-standard procedure exists, but it cannot be performed in all of the study patients, usually owing to ethical reasons. An example of such bias is that which may be encountered in a study to assess the accuracy of lung cancer screening with CT. If a patient has negative CT results, then we cannot perform biopsy or surgery to determine his or her true disease status. Verification bias occurs when patients with positive or negative test results are preferentially referred for the reference-standard procedure and then the sensitivity and specificity are based only on those patients who underwent the reference-standard test (11). This bias is counterintuitive in that investigators usually believe that including only the patients for whom there was rigorous verification of the presence or absence of disease will make their study design ideal (12). The opposite is true, however: Studies in which the most stringent verification of disease status is required and the cases with less definitive confirmation are discarded often yield the most biased estimates of accuracy (11,13).
One solution to verification bias is to design the study so that the diagnostic test results will not be used to determine which patients will undergo disease status verification. Rather, the study patients can be selected to undergo the reference-standard procedure on the basis of their signs, symptoms, and other test resultsnot the results of the test(s) evaluated in the study. This is not always possible because the test(s) under evaluation may be the usual clinical test(s) used to make diagnoses and manage the treatment of these patients.
Another solution is to use different reference-standard procedure(s) for different patients. For example, in evaluating the accuracy of CT for lung cancer screening, some patients may undergo biopsy and surgery and others can be followed up clinically and radiologically for a specified period (eg, 2 years) to detect wrongly diagnosed cases (Table). We cannot simply assume that patients with negative test results are disease free; this assumption can lead to a serious overestimation of test specificity (11).
A third solution to verification bias is to apply a statistical correction to the estimates of accuracy. A number of correction methods exist (14). Most of these methods are based on the assumption that the decision to verify a patients diagnosisthat is, to refer the patient for further diagnostic work-up, including the reference-standard test used in the studyis a conscious one and thus is based on visible factors, such as the test result and the patients signs and symptoms. To apply any of the correction methods, it is essential that we record the results of all patients who undergo the test being assessednot just those of patients who undergo the evaluated test and the reference-standard procedure.
| BIAS IN PERFORMING AND INTERPRETING TESTS |
|---|
|
|
|---|
Review bias (4) occurs when a diagnostic test, or the reference-standard test, is performed or interpreted without proper blinding (Table). Consider as an example a study to compare the capability of CT and ultrasonography (US) to depict tumors. When performing US, the technician and radiologist should not be aware of the CT findings because the technician might search with more scrutiny in locations where a tumor was found at CT and the radiologist may have a tendency to "overread" a suspicious area when he or she knows that the CT reader interpreted it to be a tumor. The simplest way to avoid this type of bias is to "blind" both the technician and the reader to the results of the other tests.
In retrospective studies in which the tests have already been performed and interpreted, it is critical that we scrutinize the usual clinical practice in search of review bias. For example, suppose we are reviewing the test findings of all patients who underwent CT and pulmonary angiography for detection of pulmonary emboli. We may find that angiography was almost always performed after CT, and we may suspect that the angiogram was obtained and interpreted with knowledge of the CT findings. For such a study, it may be possible to reinterpret the angiogram while blinded to the CT results. However, one cannot perform the angiographic examination again while blinded to the CT results. In these situations we must be aware that the potential for bias exists and interpret the study findings with the appropriate level of caution.
When two testsfor example, tests A and Bare performed in the same patient and the images are interpreted by the same reader, the images read lastfor example, the test B imageswill tend to be interpreted more accurately than the images read firstthat is, the test A imagesif the reader retains any information (15). This situation is called reading-order bias, and it can (a) negate a real difference (ie, if test A is really superior to test B), (b) inflate the true difference (ie, if test B is really superior to test A), or (c) create a difference when no true difference exists.
The simplest way to reduce or eliminate reading-order bias is to vary the order in which the test findings are interpreted (15). For example, suppose 50 patients underwent both test A and test B. The reader could first interpret the results of test A for half of the patientslet us call them group 1. Next, the reader would interpret the results of test B for the second half of the patientslet us call them group 2. After a sufficient time lag, the reader would interpret the test B results for group 1 and then the test A results for group 2. This way, the effect of reading-order bias would be cancelled out, because although the test A results would be read first for half of the patients, the test B results also would be read first for half of the patients.
Note that patients would have to be randomized (Table) to the two groups and the images obtained in the two groups would need to be presented to the readers in random order. The rationale for this protocol is that readers sometimes remember the first (and even second and last) case in a reading session, so by randomizing patients we reduce the effect of any retained information.
An additional way to reduce the effect of retained information is to allow a sufficient time lag between the first and subsequent readings of images in the same case. No standard time is appropriate for all studies. Rather, the duration of the time lag should depend on the complexity of the readings and the volume of the study cases and similar clinical cases that the reader is expected to interpret. For example, if the study cases are those from screening examinations and the reader in his or her typical clinical practice interprets the results of many screening examinations, then a short time lag (ie, a few days) is probably sufficient. In contrast, if the study cases are difficult and complex to interpret and thus a great deal of time is required to determine the diagnosis, and/or if the reader does not typically interpret the types of cases included in the study, then a long time lag (ie, several months) is needed to minimize the retained information.
One last bias that I will discuss in this section occurs when tests are interpreted in an artificial environment. Intuitively, in an experimental setting, we might expect readers to interpret cases with more care because they know that their performance is being measured. Egglin and Feinstein (16) addressed another issue that affects reader performance. They performed a study to assess the effect that disease prevalence has on test interpretation. They assembled a test set of pulmonary arteriograms with a depicted pulmonary embolism prevalence of 33% and embedded this set into two larger groups of arteriograms such that group A had an overall prevalence rate of 60% and group B an overall prevalence rate of 20%. After blinded randomized reviews by six readers, they concluded that readers accuracies differ depending on the context and often improve when the disease prevalence is higher. Egglin and Feinstein (16) defined context bias as the bias in accuracy measurements that occurs when the disease prevalence in the sample differs greatly from the prevalence in the clinical population. They suggested that investigators use a sample with a disease prevalence similar to that in the clinically relevant population.
| BIAS IN ANALYZING TEST RESULTS |
|---|
|
|
|---|
Another common problem occurs when some study forms are missing or parts of the forms are incomplete or filled out incorrectly. Response bias occurs when we include just the complete data in our analysis and ignore the missing data. The problem is that there is often a pattern to the missing datafor example, patients who are found to be disease free tend to be followed up with less scrutiny compared with patients who have disease, so data on, for example, patient satisfaction are mostly from patients with disease. However, the results might be different for disease-free patients.
Although there are statistical methods to account for data that are missing not at random (19), it is best to minimize the frequency of missing data by properly training the staff who complete the forms and including mechanisms to collect the incomplete data (eg, multiple telephone and mail messages to nonresponders, cross checks in other databases for information on medical utilization and major outcomes).
| CONCLUSION |
|---|
|
|
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. H. Park, S. Y. Kim, S. S. Lee, L. Bogoni, A. Y. Kim, S.-K. Yang, S.-J. Myung, J.-S. Byeon, B. D. Ye, and H. K. Ha Sensitivity of CT Colonography for Nonpolypoid Colorectal Lesions Interpreted by Human Readers and With Computer-Aided Detection Am. J. Roentgenol., July 1, 2009; 193(1): 70 - 78. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Warden, Z. S. Kiss, F. A. Malara, A. B. T. Ooi, J. L. Cook, and K. M. Crossley Comparative Accuracy of Magnetic Resonance Imaging and Ultrasonography in Confirming Clinically Diagnosed Patellar Tendinopathy Am. J. Sports Med., March 1, 2007; 35(3): 427 - 436. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. D. Raptopoulos, P. B. Boiselle, N. Michailidis, J. Handwerker, A. Sabir, J. A. Edlow, I. Pedrosa, and J. B. Kruskal MDCT Angiography of Acute Chest Pain: Evaluation of ECG-Gated and Nongated Techniques Am. J. Roentgenol., June 1, 2006; 186(6_Supplement_2): S346 - S356. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. T. Sica Bias in Research Studies Radiology, March 1, 2006; 238(3): 780 - 789. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. T. Bae, G. N. Mody, D. M. Balfe, S. Bhalla, D. S. Gierada, F. R. Gutierrez, C. O. Menias, P. K. Woodard, J. M. Goo, and C. F. Hildebolt CT Depiction of Pulmonary Emboli: Display Window Settings Radiology, August 1, 2005; 236(2): 677 - 684. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gaeta, F. Minutoli, E. Scribano, G. Ascenti, S. Vinci, D. Bruschetta, L. Magaudda, and A. Blandino CT and MR Imaging Findings in Athletes with Early Tibial Stress Injuries: Comparison with Bone Scintigraphy Findings and Emphasis on Cortical Abnormalities Radiology, May 1, 2005; 235(2): 553 - 561. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gur, B. Zheng, C. R. Fuhrman, and L. Hardesty On the Testing and Reporting of Computer-aided Detection Results for Lung Cancer Detection Radiology, July 1, 2004; 232(1): 5 - 6. [Full Text] [PDF] |
||||
![]() |
N. A. Obuchowski One Less Bias to Worry About [letter] Radiology, July 1, 2004; 232(1): 302 - 302. [Full Text] [PDF] |
||||
![]() |
M. W. Ragozzino, G. Brancatelli, V. Vilgrain, M. P. Federle, F. Uzan, M. Zappa, and Y. Menu Biases Likely Invalidate the Conclusions [letter] * Dr Brancatelli and colleagues respond: Radiology, June 1, 2004; 231(3): 926 - 927. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |