|
|
||||||||
Statistical Concepts Series |
1 From the Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University, Central Radiology Viewing Area, Room 117, 600 N Wolfe St, Baltimore, MD 21287. Received February 21, 2003; revision requested April 10; revision received July 18; accepted July 21. Address correspondence to the author (e-mail: jeng@jhmi.edu).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2004
Index terms: Radiology and radiologists, research Receiver operating characteristic (ROC) curve Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
|
| CONSEQUENCES OF SAMPLE SIZE CALCULATIONS |
|---|
|
|
|---|
An inadequate sample size also has ethical implications. If a study is not designed to include enough individuals to adequately test the research hypothesis, then the study unethically exposes individuals to the risks and discomfort of the research even though there is no potential for scientific gain. Although the connection between research ethicsand adequate sample size has been recognized for at least 25 years (3), the performance of clinical trials with inadequate sample sizes remains widespread (4).
Practical Consequences of Mathematic Properties
A more intuitive understanding of the determinants of sample size can be obtained through closer inspection of the formulas for sample size. We saw in the previous article (1) that when the outcome variable of a comparative study is a continuous value for which means are compared, the appropriate sample size (5) is given by
|
|
is the SD of each group, and zcrit and zpwr are constants determined by the specified significance criterion (eg, .05) and desired statistical power (eg, .8), respectively. Since zcrit and zpwr are independent of the properties of the data, sample size depends only on the ratio between the smallest meaningful difference and the SD (Fig 2).
|
Another property of the comparison of means is that for a given SD, only the arithmetic difference between the comparison groups affects the sample size. For example, the sample size would be the same for detecting a systolic blood pressure difference of 10 mm Hg whether it is to be measured in normotensive individuals (eg, 110 vs 120 mm Hg) or hypertensive individuals (eg, 170 vs 180 mm Hg).
When proportions are being compareda common task in clinical imaging researchthe sample size depends on both the smallest meaningful difference between the proportions and the size of the proportions themselves. That is, when proportions are being compared, in contrast to when means are being compared, the sample size depends not just on the difference alone. The sample size increases dramatically as the meaningful difference between proportions is made smaller (Fig 3). The sample size also increases if the two proportions being compared (ie, the mean of the two proportions) are close to 0.5.
|
|
However, it can be shown that the retrospective poweressentially an observed quantityis inversely related to the observed P value (6). The retrospective power tends to be large in any study with a small (statistically significant) observed P value. Conversely, the retrospective power tends to be small in any study with a large (statistically insignificant) observed P value. Therefore, the observed retrospective power cannot provide any information in addition to the observed P value (7,8). The important point is that the smallest meaningful difference is not the same as the observed difference: The former must be set before the study is conducted and is not determined after the study is completed.
Even though calculating the retrospective power is problematic, it remains important to consider the issue of adequate sample size when one is faced with a study whose results indicate there is no difference between comparison groups. Fortunately, several statistical approaches are available to guide the reader in terms of whether or not to "believe" a study that yields negative results (9). These approaches involve calculating CIs or performing
2 tests.
| USE OF SIMULATION TO DETERMINE SAMPLE SIZE FOR COMPLEX STUDY DESIGNS |
|---|
|
|
|---|
2 test is a special case), correlation coefficient analysis, and simple survival analysis (10). Approximations exist for some other statistical models, most notably logistic regression, but the accuracy of these approximations may be difficult to establish in all situations. Thus, the list of all statistical tests for which exact sample size calculation methods exist is much smaller than the list of all statistical tests. When no formula exists, as often happens for moderately complex statistical designs, the investigator may try to perform a sample size analysis for a simplified version of the study design and hope that the sample size can be extrapolated to the actual (more complex) study design being planned.
For situations without corresponding formulas, it is becoming more common to estimate sample size by using the technique of simulation (11). The simulation approach is powerful because it can be applied to almost any statistical model, regardless of the models complexity. In simulation, a mathematic model is used to generate a synthetic data set simulating one that might be collected in the study being planned. The mathematic model contains the dependent and independent variables being measured, along with estimates of each variables SD. The synthetic data set contains the same number of subjects as the planned sample size.
The planned statistical analysis is performed with this synthetic data set, and a P value is determined. As usual, the null hypothesis is rejected if the P value is less than a certain criterion value (eg, P < .05). This process is repeated a large number of times (perhaps hundreds or thousands of times) by using the mathematic model to generate a different synthetic data set for each iteration. The statistical power is equal to the percentage of these data sets in which the null hypothesis is rejected. In effect, simulation employs a mathematic model to virtually repeat the study an arbitrarily large number of times, allowing the statistical power to be determined essentially by direct measurement.
Since a real data set would contain random statistical error, random statistical error must be modeled in the synthetic data sets. To accomplish this in simulation, a random-number generator is used to add random error ("noise") to each synthetic data set. Because of their heavy reliance on random-number generators, simulation methods are also known as Monte Carlo methods, after the city in which random numbers also play an important role.
Let us consider a simple example. Suppose we are planning a clinical study to compare the contrast-to-noise ratio (CNR) between two magnetic resonance imaging pulse sequences used to examine each subject in a group of subjects. We would like to know the statistical power of the study to detect a smallest meaningful CNR difference of 2. We would like to plan a study with a power of .8 for detecting this smallest meaningful difference. We have resources to comfortably recruit and evaluate approximately 12 subjects. Suppose that from our previous experience with the pulse sequences, we estimate the SD of the CNR difference to be 4.
The statistical model for this study is
|
|
i is the random error associated with the observation of subject i. To run the simulation, we use a normally distributed random-number generator for
i that generates a different normally distributed random number for each of the 12 observations. The mean of the numbers generated by the random number generator is 0 and the SD is 4, which we estimated on the basis of previous experience. With these 12 random numbers, we can generate a synthetic data set of 12 observations by using Equation (2). The simulated data set is then subjected to a t test, and the resulting P value is recorded. The entire simulation process is then repeated, say, 1,000 times. The P value is recorded after each iteration. After completing the iterations, the P values are examined to determine what proportion of the iterations resulted in the detection of a statistically significant difference (indicated by P < .05); this proportion is equal to the power. The simulation for this example was performed with Stata version 7.0 (Stata, College Station, Tex), and the results are shown in the first line of the Table. In this example, the null hypothesis is rejected in 343 of the 1,000 iterations. Therefore, the statistical power of the t test, given the conditions of this example, is .34 (Table).
|
Returning to the example, we note that the estimated power of our study is lower than desired. The only way to improve the power, given our assumptions, is to increase the number of observations. (For the moment, we only have resources to study 12 subjects.) So, we decide to make four measurements of CNR difference per subject. This strategy will increase the number of observations by a factor of four and will result in an increase in power. However, it is important to realize that this data collection strategy is not the same as increasing the number of subjects by a factor of four, because the four observations within each subject are not independent of one another. Within each subject, the observations are likely to be more similar to each other than to the observations in the other subjects. In statistical terms, this lack of independence is called correlation.
Because of correlation, an additional observation in the same subject does not provide as much additional information as an additional observation in a different subject. The more similar the observations within each subject are, the less additional information will be provided by the repeated observation. If the observations within each subject are identical (100% correlated), then the study would have the same results (and sample size) as it would without the repeated observations, so there would be no benefit from repeating the measurement for the same subjects. Conversely, if the repeated observations within each subject were completely uncorrelated (0% correlation), then the results (and sample size) would be identical to those of a study with the same total number of observations but with enough additional subjects that only one observation per subject is used.
Simulation can easily account for the correlation of the four observations within each subject. The statistical model used is a slight variation of Equation (2):
|
|
ij is the random error associated with each of the 48 observations. As in Equation (2),
ij is generated by a normally distributed random-number generator having a mean of 0 and an SD of 4. In Equation (3), however,
ij is calculated in such a way that each error term
ij is correlated with the other error terms within the same subject. Correlation of the error terms is the mathematic mechanism for generating correlation in the observations. The amount of correlation is indicated by the correlation coefficient
. In this example, we assume a moderate amount of correlation (
= 0.5) between observations made within each subject. The results of the simulation are shown in the Table. With an ordinary t test, there appears to be enough power in the proposed study design (Table). But an ordinary t test is inappropriate in this case because it treats each of the 48 observations as independent, ignoring the correlation between the four observations within each subject. An appropriate method that accounts for correlation is linear regression with an adjustment for clustering. When this type of linear regression is applied instead of the t test, the simulation reveals that the power is actually .5 (Table), which is lower than the desired power of .8. Results of further simulations indicate that increasing the number of subjects from 12 to 22 would result in adequate power (Table).
A discussion of statistical tests that adjust for correlation within subjects is beyond the scope of this article. However, without a simple formula for sample size, and even without extensive knowledge of the statistical test to be used, simulation still enabled the accurate determination of power in the preceding example; this demonstrates the utility and generalizability of simulation. In addition, the effects of the use of potentially inappropriate statistical analyses were also able to be examined.
One of the barriers to performing simulation is the requirement of iterative computation, which in turn requires fast computers. This barrier is becoming much less important as the speed of commonly available computers continues to increase. Even when the barrier of computational speed is overcome, simulation is successful only if the assumed statistical model accurately describes the study design being evaluated. Therefore, appropriate attention must be paid to establishing the models validity. Fortunately, it is often easier to develop a mathematic model for a statistical situation (from which it is a straightforward process to determine power and sample size with simulation) than to search for a specific method or formula, if one even exists, for calculating sample size. In the preceding example, the introduction of correlation substantially increased the complexity of the analysis from a statistical point of view but caused only a minor change in the mathematic model and the subsequent simulation.
| SAMPLE SIZE CALCULATIONS FOR READER STUDIES |
|---|
|
|
|---|
One approach to sample size analysis in complex ROC studies involves an approximation performed by using the F distribution (12,13). Sample size tables created by using this method have been published (14); this method can also be used to calculate sample sizes for situations not addressed by such tables (Appendix). The method may be used to examine the trade-off between sample size, smallest meaningful difference, and number of readers (Fig 4). For most clinical investigations, it is likely to be difficult to include more than 10 readers or 100 cases. Given these constraints, we see that any ROC study will require at least four readers, even with a large meaningful difference of 0.15 in Az. At the other extreme, the smallest meaningful difference in Az that can be detected with 10 readers and 100 cases is 0.07. These two generalizations are based on many assumptions (Fig 4). More cases or readers are required if the interobserver variability or intraobserver variability is higher than assumed. Fewer cases or readers are required if the average Az (ie, accuracy of the readers) is higher than assumed.
|
| CONCLUSION |
|---|
|
|
|---|
At first glance, simulation may appear artificial and therefore suspicious because it relies on an equation and many assumptions about the terms in the equation, particularly the terms related to the variability of the components of the model. It should be noted, however, that similar (although perhaps less complex) mathematic models are the foundation of most statistical analyseseven simple ones like comparing means with a t test. Furthermore, estimates of variance are also required in sample size and power analysis for simple analyses like the t test. The reason for the large number of assumptions in simulations has more to do with the complexity of the data set being simulated than the method of simulation itself.
In addition to the factors usually mentioned as affecting sample size, correlation among observations within groups due to nonindependent sampling can also increase sample size and decrease statistical power. Therefore, when planning the sample size, one should take care to account for potential correlation in the data set.
| APPENDIX |
|---|
|
|
|---|
|
|
|
|
|
|
Note that Equations (A1) and (A3) have been algebraically rearranged from their published form to isolate the dependent variables for more convenient calculation. All symbols are defined in Table A1. To calculate sample size with these equations, first assign values to the variables in the first section of Table A1, then sequentially substitute the values into Equations (A1)(A4), using the suggested values of the variables in the second section of Table A1 and values from Tables A2 and A3 where indicated.
|
|
|
) of the two techniques is 0.75. The smallest meaningful difference (
) between the Az values for the two techniques is set to 0.15. Each reader interprets each case once (K = 1), and the study involves an equal number of positive and negative cases (R = 1). The difference in Az (wb) between the most accurate and least accurate observers is estimated to be 0.05. The values for ww, r1, r2, r3, and rb given in Table A1 are those suggested by published reports (14). The calculated sample size (N) is 76.
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D. P. Lovell and T. Omori Statistical issues in the use of the comet assay Mutagenesis, May 1, 2008; 23(3): 171 - 182. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. H. Zou, A. J. O'Malley, and L. Mauri Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models Circulation, February 6, 2007; 115(5): 654 - 657. [Full Text] [PDF] |
||||
![]() |
K. E. Applegate and P. E. Crewson Statistical Literacy Radiology, March 1, 2004; 230(3): 613 - 614. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |