|
|
||||||||
Special Report |
1 From the Depts of Radiology (P.J.F., C.V.J., R.G., D.R., S.M.S.T., S.M.), Obstetrics and Gynecology (C.A., G.D.P., D.P.W.), and Biostatistics (W.J.H., K.H.Z., D.E.S.), University of Rochester Medical Center, 601 Elmwood Ave, Rochester, NY 14642. Received May 7, 1997; revision requested May 27; final revision received Sep 2, 1998; accepted Feb 12, 1999. Supported in part by a grant from Innovations in Patient Care Program, Univ of Rochester Medical Center. Address reprint requests to P.J.F.
| Abstract |
|---|
|
|
|---|
MATERIALS AND METHODS: CT scans in 98 patients with ovarian carcinoma and 49 women who were disease free were retrospectively reviewed by four experienced blinded radiologists to compare single-observer reading, single-observer reading with an anatomic checklist, paired-observer reading (simultaneous double reading), and replicated reading (combination of two independent readings). Confidence level scoring was used to identify three possible disease forms in each patient: extranodal tumor, lymphadenopathy, and ascites. Patient conditions were then categorized as abnormal or normal.
RESULTS: There were no significant improvements in sensitivity or specificity for classification of patient conditions as abnormal or normal when comparing single-observer interpretation with single-observer interpretation with a checklist or paired-observer interpretation. Although there was no significant improvement in the mean sensitivity (93% vs 94%) by using the replicated reading method, there was a statistically significant improvement in mean specificity (85% vs 79%) for the replicated readings compared with single-observer interpretations (P < .05).
CONCLUSION: Diagnostic aids such as checklists and paired simultaneous readings did not lead to an improved mean observer performance for experienced readers. However, an increase in the mean specificity occurred with replicated readings.
Index terms: Diagnostic radiology, observer performance, 852.12112 Images, interpretation, 852.12112 Ovary, CT, 852.12112 Ovary, neoplasms, 852.32
| Introduction |
|---|
|
|
|---|
Ovarian carcinoma is estimated to be the fifth leading cancer cause of death in women in the United States (7). While CT has some inherent limitations in the detection of ovarian carcinoma and its metastases, it is often used preoperatively and/or in postoperative follow-up in these patients (8,9). Findings of a previous study (5) of CT in patients with ovarian carcinoma showed that in paired comparisons of three independent reviews of the images in 100 cases, there were up to 26 cases with discrepancy between two reviewers for the presence of a mass. We therefore chose to use CT in the setting of ovarian carcinoma as a model to investigate the effects of four methods of image interpretation on observer performance.
This project was designed to assess the effects of a variety of methods of image interpretation on interobserver agreement, mean sensitivity, and mean specificity. Methods to potentially improve the efficacy of standard single-observer image interpretation included an observer checklist, paired simultaneous reading by two observers, and combining two independent readings in the form of a replicated reading. Comparison of these various methods was performed by using a group of experienced radiologists (R.G., D.R., S.M.S.T., S.M.) in an effort to identify the most efficacious approach to CT interpretations in patients with ovarian carcinoma.
| MATERIALS AND METHODS |
|---|
|
|
|---|
CT studies from January 1990 to February 1995 in 98 patients with a history of active ovarian cancer and in 19 patients with disease in remission were selected from our gynecologic oncology clinic and surgery records. All patients who presented to the clinic and those who underwent surgery performed by our gynecologic oncologists (including C.A., G.D.P., D.P.W.) were identified from a computer printout of patients examined in this time frame. Thirty women without a history of gynecologic malignancy who had undergone body CT examinations during this same period were selected from the general adult patient population. Therefore, the study comprised 147 patients (98 with abnormal and 49 with normal conditions) who had undergone abdominal and/or pelvic CT.
The 98 patients with abnormal CT findings were the 17 patients with ovarian cancer at initial presentation and the 81 patients with disease recurrence recognizable at CT examination. Normal CT examinations were those in the 19 patients with ovarian cancer in remission and the 30 women with no history of gynecologic malignancy who underwent abdominal or pelvic CT examinations at our hospital for various indications.
The mean age of patients with ovarian cancer was 60 years (SD, 12 years; age range, 3082 years); the mean age of patients free of disease was 62 years (SD, 14 years; age range, 3086 years).
The patients with ovarian cancer and those free of ovarian cancer after treatment were identified from consecutive CT examinations that fulfilled specific diagnostic criteria for confirmation of active disease or remission at the time of the CT examination. The inclusion criteria in 58 patients included timely (within less than 1 month) surgical and/or percutaneous biopsy confirmation of CT findings. The inclusion criteria for the remaining 59 patients included findings of serial clinical examinations and serum tumor marker analysis and/or follow-up CT examinations over intervals of at least 6 months that corroborated the findings of the specific CT examinations included in this study.
A total of 261 patients with a history of ovarian cancer were excluded because of lack of complete CT examinations (some patients underwent no CT or only limited CT for percutaneous biopsy guidance), the presence of concomitant disease (eg, other malignancy, abscess, or lymphocele) at CT, insufficient follow-up to document the true status of the disease, and/or surgically documented disease with no recognizable disease at CT.
By using the aforementioned diagnostic information, the confirmation of CT findings for each of the 147 patients was established by means of a consensus review of all data by investigators (P.J.F., C.V.J.) who were not involved in the subsequent blinded comparative CT examination interpretations. By using all available clinical and radiologic data, the CT scans in each of the 147 patients were then reviewed to create the standard interpretations to which the blinded interpretations were compared.
Three forms of possible disease were defined: extranodal tumor mass, nodal disease, and ascites. When an abnormality was present, it was localized to the upper or lower part of the abdomen or to the upper or lower part of the pelvis so that four sites of abnormality were possible for each form of disease. The total numbers of normal and abnormal areas for each of the three forms of disease are presented in Table 1.
|
All 147 CT examinations (CT HiSpeed Advantage, HiLight Advantage, model 9800, and model 8800 scanners; GE Medical Systems, Milwaukee, Wis) were performed at our hospital between January 1990 and February 1995. More than half the patients (n = 78) received contrast material intravenously, orally, and rectally, with most of the remaining patients receiving contrast material intravenously and orally (n = 46). The remaining examinations (n = 23) were performed with at least one of these three routes of contrast material administration. Most (n = 117) examinations were performed with 10-mm contiguous images, with the remaining examinations performed with a variety of section thicknesses (5 or 7 mm) and scanning intervals (7 or 10 mm).
CT Scan Review and Replicated Reading Method
Images from the 147 CT examinations were intermixed randomly, and the examinations were subdivided into three groups as follows: group 1, 49 examinations; group 2, 49 examinations; and group 3, 49 examinations. Images from these 147 examinations were then reviewed without clinical information in a blinded retrospective manner by four attending cross-sectional imaging radiologists (including R.G., D.R., S.M.S.T., S.M.) in three sessions at three different times. Subsequently, in a fourth reading session, each of four observers reviewed images from 30 of the CT examinations (10 from each of the three original groups of examinations) to allow evaluation of calculations of intraobserver agreement. Each reading session was separated by at least 2 months. This CT scan review method is summarized in Figure 1.
|
Three direct methods of interpretation were employed in the first three reading sessions, and the four radiologists actively participated in each of these three image reviews by using confidence level scoring. The three direct methods were (a) the standard single-observer approach, (b) the single-observer interpretation with a predesigned CT checklist (Fig 2), and (c) simultaneous consensus interpretation by two observers (paired reading). By using this review format, images from each of the 147 examinations were reviewed three times by each observer (alone, with a checklist, and as pairs of readers) in the first three reading sessions, with each session separated by at least 2 months.
|
The four individual readers were randomly designated as reader 1, 2, 3, or 4. The pairs of observers for the double-reading sessions were randomly assigned and were constant throughout the study. The four readers varied in duration of specialized body imaging experience from 2 to 10 years when this study was initiated. The random reader assignments coincidentally resulted in pairing the two radiologists least experienced in body imaging (2 and 5 years experience) and the two radiologists most experienced (9 and 10 years).
Prior to the reading sessions, the observers received an overview of the use of a data collection form and confidence level scoring method: 0 indicated definitely absent, 1 indicated probably absent, 2 indicated possibly present, 3 indicated probably present, and 4 indicated definitely present. The observers assessed each patient CT study for the presence of three categories of possible disease by using confidence level scoring for each disease category, regardless of the number of sites involved. These three categories of disease included extranodal tumor mass, nodal disease, and ascites (5). At subsequent questioning, there was also agreement among all the readers that in a "forced choice" situation (normal vs abnormal) a confidence level score of 0 or 1 generally corresponded to a diagnosis of normal and scores of 24 corresponded to a diagnosis of abnormal. The division between abdomen and pelvis was defined as the iliac crest, and the division between the upper and lower parts of the abdomen or pelvis was the midway point in each area. If a mass crossed a boundary, it was scored in each area. Actual observer time per examination interpretation was also recorded.
Only those observers who used the CT checklist (Fig 2) in a given session received instruction regarding its use. Observers could use the checklist as a simple reminder to review all listed anatomic areas, or they could formally complete the form and then transfer information to the data collection form. Other detailed information given to all observers included instructions to omit chest cavity findings from their analysis, a review of CT nodal size criteria for this study, and instructions that cysts contiguous with the liver margin and all mesenteric nodules should be considered extranodal tumor masses and that centrally located simple liver cysts, along with hepatic hemangiomas and renal or adrenal lesions, should be considered incidental findings. In addition, a review of a pseudoarbitration method (12) was provided to resolve potential disputes during the paired-observer simultaneous reading sessions.
Statistical Methods
Modeling the confidence level scores.To assess the magnitude and statistical significance of the effects of various factors on the ordinal scores assigned by the readers for each of the three forms of disease, we fit a linear model to the scores (SAS Institute, "SAS/STAT Software: Changes and Enhancements through Release 6.11," SAS Institute, Cary, NC, 1996)(Appendix 1). We thus treated the scores as if continuous just for this purpose; regrettably, software modeling of ordinal scores in a complex design (with pairing) is not to our knowledge currently available. Results from ordinal regression modeling of subsets of the data agreed well with results from continuous modeling.
The results of these analyses helped in guiding the agreement and observer performance analyses. Moreover, other results of tests of significance about various measures of observer performance were in agreement with results of tests associated with this modeling.
Intraobserver and interobserver agreement.Intraobserver agreement information for the single-observer review method for each of the three forms of disease was obtained by comparing earlier single-reader observations (from reading sessions 13) with individual reinterpretations of examinations of 30 patients in reading session 4, the final session. By using the various combinations of paired comparisons, the mean interobserver agreement was calculated for each of the three direct interpretative methods (single observer, single observer with anatomic checklist, and paired observers). Intraobserver agreement and interobserver agreement were initially calibrated by using the weighted
statistic (13), in which we assigned weight w to a disagreement for which the two readings were w confidence levels apart. Also, a simple
analysis for the three direct interpretative methods was performed for dichotomized patient condition categorizations of normal versus abnormal.
Sensitivity and specificity.Analysis of sensitivity and specificity was carried out to estimate observer performance in the context of the more commonly used measures of a method's utility. In this analysis, sensitivity and specificity were defined with respect to the true status of the patient's condition (diseased vs normal).
To achieve a dichotomy of yes or no responses (ie, disease present or absent) for this analysis, the observers' confidence level scores for a patient's three possible forms of disease were redefined as absent (score of 0 or 1) or present (score of 24). Our readers were in agreement that this was the most appropriate format for dichotomizing their confidence level responses.
The localization data for each of the three forms of disease in each of the four body areas were then tabulated by categorizing the observer responses for detection of disease as true-positive, true-negative, false-positive, or false-negative findings in a comparison with our previously defined standards of reference (see Materials and Methods, Patient Population). Subsequent data analysis for the 98 patients with abnormal and the 49 patients with normal CT findings (in contrast with normal and abnormal areas on CT scans) was accomplished by using a 0 or 1 confidence level score for all of the three forms of disease as a diagnosis of normal and scores of 24 for any of the three forms of disease as a diagnosis of abnormal.
The mean sensitivities and specificities for a given reading method were obtained by averaging the various observer's respective results within that reading method.
When comparing two or more proportions (eg, sensitivities across observers or across interpretative methods), McNemar
2 tests were used. When comparing mean proportionsthat is, proportions averaged across observersmean proportions and differences in mean proportions were evaluated relative to an estimated standard error (Appendix 2). All tests accounted for inherent dependence because several methods were evaluated in the same patients. Effects were considered statistically significant if P values (two-sided) were less than .05 (with no adjustments for multiple comparisons).
Receiver operating characteristic analysis.In addition, Receiver operating characteristic (ROC) analyses were performed on the data from the three direct interpretative methods to analyze the confidence level scores without dichotomizing the observer responses. Separate analyses were carried out for each observer and each method, but only pooled (across observers) data curves are presented, as conclusions from the others were similar to those from the traditional sensitivity and specificity analyses. A ROC analysis could not be performed for the replicated reading methods, as the data for this interpretative form were obtained by dichotomizing the original confidence level scores.
ROC curves were constructed by using ROCFIT (Metz CE, Shen JH, Wang PL, Kronman HB, Fortran program ROCFIT, Department of Radiology, University of Chicago, Ill, 1994) and CORROC2 (Metz CE, Shen JH, Kronman HB, Wang PL, Fortran program CORROC2, Department of Radiology, University of Chicago, Ill, 1994) software (14,15) for fitting binormal models to rating data. ROC curves were constructed for each reader or reader pair and for each of the first three direct interpretative methods. When comparing two interpretative methods, CORROC2 software was used, with recognition that data from the same 147 patients were being studied. To construct a summary curve, by summarizing over readers, we used the ad hoc method of simply pooling all readings together as if they were single readings in different patients rather than multiple readings in the same patients. No significance testing was done with these summary curves.
ROC analyses were performed by using the overall classification of patient conditions as abnormal or normal (rather than abnormalities in separate forms of disease). The overall confidence level score (rating) of a patient condition along the confidence level spectrum from abnormal to normal was defined as the highest of the scores for the three forms of disease.
To compare curves, three summary characteristics were considered: area under the curve, true-positive rate at a false-positive rate of 0.1, and false-positive rate at a true-positive rate of 0.9. When comparing characteristics of two curves, significance testing was done by comparing the difference between characteristics with an estimated standard error. When comparing characteristics of three (or four) curves simultaneously, we treated the sum of squared deviations from the mean characteristic divided by a pooled variance estimate as a
2 with 2 (or 3) degrees of freedom. Similar significance testing methods were used for comparing true-positive rates and false-positive rates across readers or across methods. Effects were considered statistically significant if P values were less than .05.
Analysis of the time to read images.Each observer's mean time to read images from an examination for each of the four methods of interpretation was compared to those of the other observers. The mean times to read images from examinations for all observers combined within each of the various reading methods were also compared. The replicated reading session times are the sum of the mean times to read for each single observer that were artificially paired to create the various replicated readings. Some of the time data are not available because the data collection form was simplified during the first reading session; therefore, 60% (294 of 490) of the image interpretation times from reading session 1 were disregarded. Furthermore, the time to read was not recorded by the observers on some occasions.
| RESULTS |
|---|
|
|
|---|
When comparing confidence level scores among the three direct interpretive methods for each reader, no statistically significant differences were found. Other comparisons of the methods are considered below.
Intraobserver and Interobserver Agreement
Table 2 summarizes the two
analyses of intraobserver and interobserver agreement for recognition of each of the three forms of disease and for categorization of patient conditions as normal or abnormal for the three direct interpretative methods. There was very good intraobserver agreement for categorization of patient conditions with respect to extranodal tumor masses (mean weighted
statistic of 0.79) and ascites (mean weighted
statistic of 0.88). There was less intraobserver agreement among observers for recognition of nodal disease; the mean weighted
statistic of 0.49 reflects a moderate or fair level of mean intraobserver agreement. Generally, the intraobserver agreement was greater than interobserver agreement for the three possible forms of disease.
|
analysis for intraobserver and interobserver agreement for the dichotomized confidence level scores (disease present or absent) yielded a mean intraobserver
of 0.69. This was similar to the mean interobserver agreements for each of the three direct interpretative methods, which ranged from 0.64 to 0.75.
Sensitivity and Specificity
The means and ranges across observers for sensitivity and specificity for the various interpretative methods are shown in Table 3. There were statistically significant differences among observers for sensitivity (range, 87%97%, P < .01) and specificity (range, 65%92%, P < .01), both for single reading and reading with the checklist, but there were no statistically significant differences among the first three methods for any particular observer. The largest differences in sensitivity and specificity between single readers for classification of patient conditions as abnormal or normal were for observers 1 and 3: Sensitivities for these observers were 97% and 87%, respectively, and specificities were 67% and 92%, respectively.
|
We estimate (by using information in Table 3 [details omitted]) that the power to detect a 5% improvement in sensitivity in any of the alternative reading methods compared with the single observer method was 90% or more; the power to detect a 3% improvement was around 50%. We estimate that the power to detect a 17% improvement in specificity in any of the alternative reading methods compared with the single-observer method was 90% or more; the power to detect a 10% improvement was around 50%.
ROC Analysis
Figure 3 presents the ROC analyses of abnormal versus normal patient condition assessment for each observer by using each of the three direct interpretative methods; ROC curves of reading method 1 (single observer) were compared with those of reading method 2 (checklist) and with reading method 3 (paired reading). There were no statistically significant differences between these curves or in the areas under the curves for any reader when comparing method 1 with method 2 or method 1 with method 3.
|
|
|
|
|
|
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
Other investigators (1,3,4) have shown a wide spectrum of observer sensitivities and specificities for other radiologic examinations. To our knowledge, much of the previous literature on this topic has been in the area of mammography. In one review (3) of 10 radiologists' independent mammographic interpretations in 150 patients (for a mixture of screening and diagnostic studies), there was a median sensitivity for a diagnosis of "suggestive of cancer" of 70% with a range of 37%85% and a median specificity of 93% with a range of 85%99%. Furthermore, in 33% of patients recommended for additional work-up in this study (3), at least one of the 10 radiologists differed with regard to the side of the possible abnormality. In a different project (4) in which the variability in breast cancer detection was assessed for a group of 108 radiologists reviewing images from 79 screening mammographic examinations, a range of sensitivity of 47%100% and a range of specificity of 36%99% were found. In our study, the results were less variable than the aforementioned mammographic assessments; the individual reader sensitivities ranged from 87% to 97% and the individual reader specificities ranged from 67% to 92% for classification of patient conditions as abnormal or normal.
Reproducibility of interpretation results is an important feature of diagnostic efficacy. We used the weighted
statistic for analysis of intraobserver and interobserver agreements or reliabilities for our confidence level scoring system. The critical issue in the comparison of these weighted
values relates to determining which methods of interpretation will yield the highest level of interobserver agreement. However, neither using the checklist when interpreting studies nor the paired readings yielded an increase in mean interobserver agreement over that for the standard independent interpretations for our experienced readers. Given the lack of improvement in mean sensitivity and specificity for the three direct interpretative methods in our study, a more detailed discussion of the differences in interobserver agreement among those methods becomes moot. We did not assess interobserver agreement for the replicated reading method because by design this method forces interpretations to be in greater agreement.
Several previous investigators (1012,1931) have addressed methods ranging from simple to more complex to improve individual observer performances. These methods have included use of various checklists; simultaneous pairing of observers; sequential pairing of observers, with the second independent observer either aware or unaware of the first interpretation; replicated readings; group consensus; group consultation followed by independent diagnoses; the Delphi technique; computerized noninteractive consultation; mathematic combinations of different interpretations; computerized sequential decision making; and computer-assisted interpretations (1012,1931). We chose to emphasize three of these methods to potentially improve the single-observer performance of CT interpretation in light of the improved observer performance previously reported with these methods (6,10,12,21,22).
A checklist worksheet has been suggested as a method to improve diagnostic accuracy for body CT examinations (10). It is of interest that, on the basis of our ROC analysis, three of our four readers had poorer results when using a checklist than they did with their independent readings.
Furthermore, on the basis of the ROC analysis, there were variable effects recognized when pairing our observers. There was no benefit to pairing observers 1 and 2. Pairing observers 3 and 4 yielded mean false-positive rates and true-positive rates, with the performance of observer 3 improving at the expense of that of observer 4. The net result was no significant improvement in overall mean sensitivity or specificity. Representative examples of false-negative and false-positive diagnoses in our study are illustrated in Figures 5 and 6.
|
|
An earlier investigation compared single readings of chest photofluorograms obtained for tuberculosis screening with a combination of two separate interpretations (replicated readings) to improve accuracy (20). In those cases where the combined single readings differed, a third opinion was used, with an additional 10% improvement in sensitivity when compared with the sensitivity of individual interpretations and without a change in specificity (20).
Sequential double reading, where the second observer often had knowledge of the original interpretation (21), and independent sequential double readings (22) of screening mammograms have yielded increases in cancer detection sensitivity of 10%15%, with variable effects on specificity. However, using data from their previous mammographic study (4), Beam et al (23) examined the effects of a form of independent double reading and concluded that radiologists may form complementary or noncomplementary pairs. The average radiologist in their study (23) had an increase in the true-positive rate of 0.11 accompanied by an increase in the false-positive rate of 0.07. As in our study, some observer pairings in their study resulted in no change or in small changes in sensitivities and specificities.
The accuracy of radiologic examinations for recognizing abnormalities requires the identification and proper interpretation of various examination findings. Some studies of observer performance have purposely focused on interpretation by identifying the radiologic abnormality for the observers (24,28). In addition, the cognitive methods used to influence observer performance have had varied levels of complexity. We focused on three of the simpler interpretative methods (checklists, paired observers, and replicated readings) supplementary to the standard single-observer approach and used experienced observers who were required to both identify and interpret abnormal findings. Contrary to what might be inferred from most of the published literature, these approaches are not always successful in improving the performance of experienced observers.
There is also a potential cost in time that may occur with any supplemental interpretative methods. It should be noted that in the context of a study such as this, observers' times to read do not incorporate the time involved in developing a protocol or monitoring CT examinations, patient preparation, review of clinical data or prior studies, report dictation, and review or consultation with clinicians. In our investigation, the greatest cost in terms of physician time occurred with paired and replicated readings. In the absence of documented improvements in sensitivity and/or specificity, the cost of paired readings is not justified in this patient population for our readers. Replicated readings, or some modification thereof, may be more justifiable.
The limitations of studies such as ours are well outlined in the project of Elmore et al (3), who assessed variability of mammographic interpretations. These limitations include the effects of a study situation, where radiologists may be more (or less) diligent in their examination reviews than they are in daily practice and where the images from previous examinations are unavailable for comparison (3).
Another limitation of our study was that each disease site (upper vs lower abdomen or pelvis) was not documented with surgery in every case. As our focus was on observers categorizing patients with any evidence of ovarian cancer on CT scans versus patients with no disease on CT scans, the effect of lack of surgical confirmation at all anatomic sites was minimized. We believe our combined verification criteria listed in Materials and Methods were sufficiently accurate standards of reference for the CT findings to enable us to address the objectives of this study.
It is also important to note that these results may apply only to our four readers for this particular indication for CT examination. In addition, the results might have differed if only the more difficult cases had been addressed. Furthermore, a variety of CT scanning protocols were used in our patient population, which may have influenced our results. Continued refinements and standardization of CT protocols should serve to minimize observer variability and improve observer performance.
Most research in radiology focuses on applications of technologic improvements in complex imaging modalities including CT, MR imaging, and ultrasonography. Less effort has been devoted to improving the methods of image interpretation on the part of the radiologist, the human element in the process. Methods to improve interobserver reliability and observer performance for complex examinations, such as CT, may improve on delays in diagnosis, which contribute to delays in hospital discharge and lead to additional costly diagnostic testing. Independent replicated readings slightly increased the mean specificity of the CT interpretations, but this was the most time-consuming interpretative method.
Diagnostic aids such as anatomic checklists and paired simultaneous readings did not lead to improved mean interobserver agreement, sensitivity, or specificity for experienced readers in our study. Potential benefits of various forms of paired reading and other diagnostic aids must be weighed carefully against the increased physician time commitment. While some methods may assist less experienced readers or even experienced individuals for certain possible diseases or radiologic examinations, their universal application may not be efficacious. Certain interpretative methods, such as the integration of independent readings to form a noninteractive replicated reading, may be helpful for improving observer performance for experienced readers.
| Appendix 1 |
|---|
|
|
|---|
Differing error variances for differing methods and normal-versus-diseased conditions were allowed, as were correlations between random effects involving the same observers. The 27 parameters (seven fixed effects, 11 variances of random effects, three correlation coefficients, and six error variances) were fitted by using mixed-effect linear-model methods, and then submodels were fitted after the finding of many nonsignificant parameters.
The true number of areas with abnormalities and the random patient effects removed a major part of the variability of the scores, which enabled inference about the effects of interestin particular, about the reading methods.
| Appendix 2 |
|---|
|
|
|---|
To assess precision, a standard error is needed. Since all systems were evaluated in the same 98 patients with disease or the same 49 healthy persons, dependencies need to be accounted for. We illustrate with consideration of the difference between mean paired-observer sensitivity and mean single-observer sensitivity in six of the 22 systems.
We labeled the two versions of paired observers as systems 1 and 2 and the four versions of single observers as systems 36. We let pi represent the proportion of the 98 patients with disease classified correctly with system i, and, for later use, we let pij represent the proportion classified correctly with both system i and system j. (Note that pi = pii.) Then, interest centered on the difference d in mean sensitivities: d = [(p1 + p2)/2] - [(p3 + p4 + p5 + p6)/4]. Writing this as the inner product of two vectors, we had d = c'p, with c having elements
,
, -
, -
, -
, and -
and with p having elements p1 to p6.
We let V represent the 6 x 6 matrix of variances and covariances, with (i,j)-element vij = cov(pi,pj). Then, the variance of the difference d was c'Vc (the double summation of cicjvij). Moreover, the variance vii of pi was estimated by using the familiar pi(1 - pi)/n, while the covariance vij was (pij - pipj)/n; here n is 98. This enabled estimation of the variance of d, and the desired standard error was the square root thereof. Finally, d was evaluated relative to its standard error by reference to a standard normal distribution. (Notice that this method uses two 22 x 22 matrices of proportions pij, one for sensitivities and one for specificities.)
To evaluate differences among several means simultaneously, an appropriate quadratic form is constructed and evaluated against a
2 distribution.
| Acknowledgments |
|---|
| Footnotes |
|---|
3 Dept of Obstetrics and Gynecology, New York University Medical School, NY. ![]()
4 Dept of Obstetrics and Gynecology, Cooper Health System, Camden, NJ ![]()
5 Dept of Health Care Policy, Harvard Medical School and Dept of Radiology, Brigham and Women's Hospital, Boston, Mass ![]()
6 Center for Biostatistics in AIDS Research, Harvard School of Public Health, Boston, Mass ![]()
Author contributions: Guarantor of integrity of entire study, P.J.F.; study concepts, P.J.F., C.V.J.; study design, P.J.F., D.E.S., W.J.H.; definition of intellectual content, P.J.F., W.J.H.; literature research, P.J.F., W.J.H., K.H.Z.; clinical studies, P.J.F., C.V.J., R.G., D.R., S.M.S.T., S.M., C.A., G.D.P., D.P.W.; data acquisition, P.J.F., C.V.J., R.G., D.R., S.M.S.T., S.M., C.A., G.D.P., D.P.W.; data analysis, P.J.F., W.J.H., K.H.Z., C.V.J.; statistical analysis, W.J.H., K.H.Z.; manuscript preparation, P.J.F., W.J.H., K.H.Z., D.E.S.; manuscript editing, P.J.F., C.V.J., W.J.H., R.G., S.M.S.T., G.D.P., D.P.W., K.H.Z., D.E.S.; manuscript review, all authors
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
P Goddard, A Leslie, A Jones, C Wakeley, and J Kabala Error in radiology Br. J. Radiol., October 1, 2001; 74(886): 949 - 951. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||