|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical Concepts Series |
1 From the Department of Radiology, Riley Hospital for Children, Indianapolis, Ind (K.E.A.); Department of Radiology, Boston University, 88 E Newton St, Atrium 2, Boston, MA 02118 (R.T.); and Departments of Radiology and Biostatistics, Indiana University School of Medicine, Indianapolis (J.Y.). Received October 16, 2002; revision requested February 18, 2003; final revision received March 13; accepted April 1. Address correspondence to R.T. (e-mail: tello@alum.mit.edu).
| ABSTRACT |
|---|
|
|
|---|
2 test, Fisher exact test, and McNemar test. When the data are continuous, different nonparametric tests are used to compare paired samples, such as the Mann-Whitney U test (equivalent to the Wilcoxon rank sum test), the Wilcoxon signed rank test, and the sign test. These nonparametric tests are considered alternatives to the parametric t tests, especially in circumstances in which the assumptions of t tests are not valid. For radiologists to properly weigh the evidence in the literature, they must have a basic understanding of the purpose, assumptions, and limitations of each of these statistical tests. © RSNA, 2003
Index terms: Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
The purpose of this article is to discuss different nonparametric or distribution-free tests and their applications with continuous and categorical data. For the analysis of continuous data, many radiologists are familiar with the t test, a parametric test that is used to compare two means. However, misuse of the t test is common in the medical literature (2). To perform t tests properly, we need to make sure the data meet the following two critical conditions: (a) The data are continuous, and (b) the populations are distributed normally. In this article, we introduce the application of nonparametric statistical methods when these two assumptions are not met. These methods require less stringent assumptions of the population distributions than those for the t tests. When two populations are independent, the Mann-Whitney U test can be used to compare the two population distributions (3). An additional advantage of the Mann-Whitney U test is that it can be used to compare ordinal data, as well as continuous data. When the observations are in pairs from the same subject, we can use either the Wilcoxon signed rank test or the sign test to replace the paired t test.
For categorical data, the
2 test is often used. The
2 test for goodness of fit is used to study whether two or more mutually independent populations are similar (or homogeneous) with respect to some characteristic (412). Another application of the
2 test is a test of independence. Such a test is used to determine whether two or more characteristics are associated (or independent). In our discussion, we will also introduce some extensions of the
2 test, such as the Fisher exact test (13,14) for small samples and the McNemar test for paired data (15).
| CATEGORICAL DATA |
|---|
|
|
|---|
2 test (16). If two groups of subjects are sampled from two independent populations and a binary outcome is used for classification (eg, positive or negative imaging result), then we use the
2 test of homogeneity. Sometimes radiologists are interested in analyzing the association between two criteria of classification. This results in the test of independence by using a similar 2 x 2 contingency table and
2 statistic. When sample sizes are small, we prefer to use the Fisher exact test. If we have paired measurements from the same subject, we use the McNemar test to compare the proportions of the same outcome between these two measurements in the 2 x 2 contingency table.
2 Test
The
2 test allows comparison of the observed frequency with its corresponding expected frequency, which is calculated according to the null hypothesis in each cell of the 2 x 2 contingency table (Eq [A1], Appendix A). If the expected frequencies are close to the observed frequencies, the model according to the null hypothesis fits the data well; thus, the null hypothesis should not be rejected. We start with the analysis of a 2 x 2 contingency table by considering the following two examples. The same
2 formula is used in both examples, but they are different in the sense that the data are sampled in different ways.
Example 1: test of homogeneity between two groups.One hundred patients and 100 healthy control subjects are enrolled in a magnetic resonance (MR) imaging study. The MR imaging result can be classified as either "positive" or "negative" (Table 1). The radiologist is interested in finding out if the proportion of positive findings in the patient group is the same as that in the control group. In other words, the null hypothesis is that the proportion of positive findings is the same in the two groups. The alternative hypothesis is that they are different. We call this a test of homogeneity. In this first example, the two groups (patients and subjects) are in the rows, and the two outcomes of positive and negative test results are in the columns. In the statistical analysis, only one variable, the imaging result (classified as positive or negative), is considered.
|
2 statistic is calculated and yields a P value of .001 (17). Typically, we reject the null hypothesis if the P value is less than .05 (the significance level). In this example, we conclude that there is no homogeneity between the two groups, since the proportions of positive imaging results are different.
Example 2: test of independence between two variables in one group.A radiologist studies gadolinium-based contrast material enhancement of renal masses at MR imaging in 65 patients (18). Table 2 shows that there are 17 patients with enhancing renal masses, with 14 malignant masses and three benign masses at pathologic examination. Among the 48 patients with nonenhancing renal masses, three masses are malignant and 45 are benign at pathologic examination. In this example, the presence or absence of contrast enhancement is indicated in the rows, and the malignant and benign pathologic findings are in the columns. In this second example, only the total number of 65 patients is fixed; the presence or absence of contrast enhancement is compared with the pathologic result (malignant or benign). The question of interest is whether these two variables are associated. In other words, the null hypothesis is that contrast enhancement of a renal mass is not associated with the presence of a malignant tumor, and the alternative hypothesis is that enhancement and malignancy are associated. In this example, the
2 statistic yields a P value less than .001. We reject the null hypothesis and conclude that the presence of contrast enhancement at MR imaging is associated with renal malignancy.
|
2 test is that the
2 statistic is discrete, since the observed frequencies in the 2 x 2 contingency table are counts. However, the
2 distribution itself is continuous. In 1934, Yates (12) proposed a procedure to correct for this possible bias. Although there is controversy about whether to apply this correction, it is sometimes used when the sample size is small. In the first example discussed earlier, the
2 statistic was 10.17, and the P value was .001. The Yates corrected
2 statistic is 9.27 with a P value of .002. This corrected
2 statistic yields a smaller
2 statistic, and the P value is larger after Yates correction. This indicates that the Yates corrected
2 test is less powerful in rejecting the null hypothesis. Some applications of Yates correction in medicine are discussed in the statistical textbook by Altman (19).
Fisher Exact Test
When sample sizes are small, the
2 test yields poor results, and the Fisher exact test is preferred. A general rule of thumb for its use is when either the sample size is less than 30 or the expected number of observations in any one cell of a 2 x 2 contingency table is fewer than five (20). The test is called an "exact" test because it allows calculation of the exact probability (rather than an approximation) of obtaining the observed results or results that are more extreme. Although radiologists may be more familiar with the traditional
2 test, there is no reason not to use the Fisher exact test in its place, given the ease of use and availability of computer software today.
In example 1, the P value resulting from use of the
2 test was .001, whereas the P value for the same data tested by using the Fisher exact test was .002. Both tests lead to the same conclusion of lack of homogeneity between the patient and control groups. Intuitively, the P value derived by using the Fisher exact test is the probability of positive results becoming more and more discrepant between the two groups. Most statistical software packages provide computation of the Fisher exact test (Appendix B).
Example 3: Fisher exact test.A radiologist enrolls 20 patients and 20 healthy subjects in a computed tomographic (CT) study. The CT result is classified as either "positive" or "negative." Table 3 shows that 10 patients and four healthy subjects have positive findings at CT. The null hypothesis is that the two populations are homogeneous in the number of positive findings seen at CT.
|
2 test incorrectly, the P value is .05, which suggests the opposite conclusionthat the proportions of positive CT results are different in these two groups.
McNemar Test for Paired Data
A test for assessment of paired count data is the McNemar test (15). This test is used to compare two paired measurements from the same subject. When the sample size is large, the McNemar test follows the same
2 distribution but uses a slightly different formula. Radiology research often involves the comparison of two paired imaging results from the same subject. In a 2 x 2 table, the results of one imaging test are labeled "positive" and "negative" in rows, and the results of another imaging test are labeled similarly in columns. An interesting property of this table is that there are two concordant cells in which the paired results are the same (both positive or both negative) and two discordant cells in which the paired results are different for the same subject (positive-negative or negative-positive).
We are interested in analyzing whether these two imaging tests show equivalent results. The McNemar test uses only the information in the discordant cells and ignores the concordant cell data. In particular, the null hypothesis is that the proportions of positive results are the same for these two imaging tests, versus the alternative hypothesis that they are not the same. Intuitively, the null hypothesis is retained if the discordant pairs are distributed evenly in the two discordant cells. The following example illustrates the problem in more detail.
Example 4: McNemar test for paired data.There are 200 patients enrolled in a study to compare CT and conventional angiography of coronary bypass grafts for the diagnosis of graft patency (Table 4). Seventy-one patients have positive results with both conventional angiography and CT angiography, 86 have negative results with both, 30 have positive CT results but negative conventional angiographic results, and 13 have negative CT results but positive conventional angiographic results. The McNemar test compares the proportions of the discordant pairs (13 of 200 vs 30 of 200). The P value of the McNemar statistic is .02, which suggests that the proportion of positive results is significantly different for the two modalities. Therefore, we conclude that the ability of these two modalities to demonstrate graft patency is different.
|
2 test, as discussed in example 1 (21). This is a common mistake in the medical literature. In example 1, the proportions compared are 101 of 200 versus 84 of 200. The problem is the assumption that CT angiography and conventional angiography results are independent, and thus, the paired relationship between these two imaging tests is ignored (2,21). The
2 test has less power to reject the null hypothesis than does the McNemar test in this situation and results in a P value of .09. We would incorrectly conclude that there is no significant difference in the ability of these two modalities to demonstrate graft patency.
|
| HYPOTHESIS TESTING BY USING MEDIANS |
|---|
|
|
|---|
Nonparametric tests were developed to deal with situations where the population distributions are either not normal or unknown, especially when the sample size is small (<30 samples). These tests are relatively easy to understand and simple to apply and require minimal assumptions about the population distributions. However, this does not mean that they are always preferred to parametric tests. When the assumptions are met, parametric tests have higher testing power than their nonparametric counterparts; that is, it is more likely that a false null hypothesis will be rejected.
Three commonly encountered nonparametric tests include the Mann-Whitney U test (equivalent to the Wilcoxon rank sum test), the Wilcoxon signed rank test, and the sign test.
Comparison of Two Independent Samples: Mann-Whitney U Test
The Mann-Whitney U test is used to compare the difference between two population distributions and assumes the two samples are independent (22). It does not require normal population distributions, and the measurement scale can be ordinal.
The Mann-Whitney U test is used to test the null hypothesis that there is no location difference between two population distributions versus the alternative hypothesis that the location of one population distribution differs from the other. With the null hypothesis, the same location implies the same median for the two populations. For simplicity, we can restate the null hypothesis: The medians of the two populations are the same. Three alternative hypotheses are available: (a) The population medians are not equal, (b) the population median of the first group is larger than that of the second, or (c) the population median of the second group is larger than that of the first. If we put the two random samples together and rank them, then, according to the null hypothesis, which holds that there is no difference between the two populations medians, the total rank of one sample would be close to the total rank of the other. On the other hand, if all the ranks of one sample are smaller than the ranks of the other, then we know almost surely that the location of one population is shifted relative to that of the other.
We give two examples of the application of the Mann-Whitney U test, one involving continuous data and the other involving ordinal data.
Example 5: Mann-Whitney U test for continuous data.The uptake of fluorine 18 choline (hereafter, "fluorocholine") by the kidney can be considered approximately distributed normally (23). Let us say that some results of hypothetical research suggest that fluorocholine uptake above 5.5 (percentage dose per organ) is more common in men than in women. If we are only interested in the patients whose uptake is over 5.5, the distribution is no longer normal but becomes skewed. The Figure shows the uptake over 5.5 in 10 men and seven women sampled from populations imaged with fluorocholine for tumor surveillance. We are interested in finding out if there are any differences in these populations on the basis of patient sex.
|
Example 6: Mann-Whitney U test for ordinal data.A radiologist wishes to know which of two different MR imaging sequences provides better image quality. Twenty-four patients undergo MR imaging with a T2-weighted fast spin-echo sequence, and 22 other patients are imaged with the same sequence but with the addition of fat saturation (Table 6). The image quality is measured by using a standardized scoring system, ranging from 1 to 100, where 100 is the best image quality. The null hypothesis is that the median scores are the same for the two populations. In the group imaged with the first MR sequence, the images of eight subjects are scored under 25, those of 14 subjects are scored between 25 and 75, and those of two subjects are scored above 75. In the group imaged with the fat-saturated MR sequence, there are three, 12, and seven subjects in these three score categories, respectively.
|
2 statistic to compare the two groups. The P value corresponding to the
2 statistic is .08, and we would conclude that the two groups have similar image quality. The problem is that the three image quality score categories are only treated as nominal variables, and their ordinal relationship is not accounted for in the
2 test. An alternative test that allows us to use this information is the Mann-Whitney U test. The Mann-Whitney U test yields a P value of .03. We reject the null hypothesis and conclude that the median image scores are different.
Comparison of Paired Samples: Wilcoxon Signed Rank Test
The Wilcoxon signed rank test is an alternative to the paired t test. Each paired sample is dependent, and the data are continuous. The assumption needed to use the Wilcoxon signed rank test is less stringent than the assumptions needed for the paired t test. It requires only that the paired population be distributed symmetrically about its median (24).
The Wilcoxon signed rank test is used to test the null hypothesis that the median of the paired population differences is zero versus the alternative hypothesis that the median is not zero. Since the distribution of the differences is symmetric about the mean, it is equivalent to using the mean for the purpose of hypothesis testing, as long as the sample size is large enough (at least 10 rankings).
We rank the absolute values of the paired differences from the sample. With the null hypothesis, we would expect the total rank of the pairs whose differences are negative to be comparable to the total rank of the pairs whose differences are positive. The following example shows the application of the Wilcoxon signed rank test.
Example 7: paired data.A sample of 20 patients is used to compare ring enhancement between T1-weighted spin-echo MR images and fat-saturated T1-weighted spin-echo MR images obtained after contrast material administration (Table 7). We notice that the image quality scores on fat-saturated T1-weighted spin-echo MR images in case 7 is 98, which is much higher than the others. As a result, the difference in values between the two sequences is also much higher than that for the other paired differences. It would be unwise to use a paired t test in this case, since the t test is sensitive to extreme values in a sample and tends to incorrectly retain a false null hypothesis as a consequence. The nonparametric tests are more robust to data extremes, and thus, the Wilcoxon signed rank test is preferred in this case. The null hypothesis states that the median of the paired MR sequence differences is zero. The Wilcoxon signed rank test provides a P value of .02, so we reject the null hypothesis. We conclude that the fat-saturated MR sequence showed ring enhancement better than did the MR sequence without fat saturation. If we had incorrectly used the paired t test, the P value would be .07, and we would have arrived at the opposite conclusion.
|
| SUMMARY |
|---|
|
|
|---|
| APPENDIX A |
|---|
|
|
|---|
2 formula is based on the following equation:
|
|
| APPENDIX B |
|---|
|
|
|---|
2 and McNemar statistics include SPSS, SAS, StatXact 5, and EpiInfo (EpiInfo allows calculation of the Fisher exact test and may be downloaded at no cost from the Centers for Disease Control and Prevention Web site at www.cdc.gov). Other statistical Web sites include fonsg3 .let.uva.nl/Service/Statistics.html, department .obg.cuhk.edu.hk/ResearchSupport/WhatsNew .asp, and www.graphpad.com/quickcalcs/Contingency1.cfm (all Web sites accessed January 30, 2003). | ACKNOWLEDGMENTS |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
G. Bierry, N. Holl, F. Kellner, S. Riehm, M.-N. Roedlich, M. Greget, and F. Veillon Venous Thromboembolism and Occult Malignancy: Simultaneous Detection During Pulmonary CT Angiography with CT Venography Am. J. Roentgenol., September 1, 2008; 191(3): 885 - 889. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Vano, J. M. Fernandez, J. I. Ten, C. Prieto, L. Gonzalez, R. Rodriguez, and H. de Las Heras Transition from Screen-Film to Digital Radiography: Evolution of Patient Radiation Doses at Projection Radiography Radiology, May 1, 2007; 243(2): 461 - 466. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Schlosser, P. Hunold, T. Voigtlander, A. Schmermund, and J. Barkhausen Coronary Artery Calcium Scoring: Influence of Reconstruction Interval and Reconstruction Increment Using 64-MDCT Am. J. Roentgenol., April 1, 2007; 188(4): 1063 - 1068. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Erturk and A. J. Megibow Evaluation of Bowel Distention with a Neutral Contrast Agent: Some Statistical Concerns Radiology, December 1, 2006; 241(3): 947 - 947. [Full Text] [PDF] |
||||
![]() |
Y.-Y. Liao, T.-S. Lee, a. Y.-M. Lin, and C. W. A. Pfirrmann A fisher exact test will be more proper. Radiology, April 1, 2006; 239(1): 300 - 301. [Full Text] [PDF] |
||||
![]() |
T. Schlosser, P. Hunold, C. U. Herborn, H. Lehmkuhl, A. Lind, S. Massing, and J. Barkhausen Myocardial Infarct: Depiction with Contrast-enhanced MR Imaging--Comparison of Gadopentetate and Gadobenate Radiology, September 1, 2005; 236(3): 1041 - 1046. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Goldberg-Zimring, B. Shalmon, K. H. Zou, H. Azhari, D. Nass, and A. Achiron Assessment of Multiple Sclerosis Lesions with Spherical Harmonics: Comparison of MR Imaging and Pathologic Findings Radiology, June 1, 2005; 235(3): 1036 - 1044. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. M. Yeh, P. Kurzman, E. Foster, A. Qayyum, B. Joe, and F. Coakley Clinical Relevance of Retrograde Inferior Vena Cava or Hepatic Vein Opacification During Contrast-Enhanced CT Am. J. Roentgenol., November 1, 2004; 183(5): 1227 - 1232. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Schlosser, P. Hunold, A. Schmermund, H. Kuhl, K.-U. Waltering, J. F. Debatin, and J. Barkhausen Coronary Artery Calcium Score: Influence of Reconstruction Interval at 16-Detector Row CT with Retrospective Electrocardiographic Gating Radiology, November 1, 2004; 233(2): 586 - 589. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. L. Sistrom and C. W. Garvan Proportions, Odds, and Risk Radiology, January 1, 2004; 230(1): 12 - 19. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |