|
|
||||||||
Neuroradiology |
1 From the Mallinckrodt Institute of Radiology (A.J.J., D.K.K., M.M.B., M.H.G., B.C.P.L., C.J.M., F.J.W.), Department of Medicine (A.J.J., W.D.S., B.L.), and Division of Biostatistics (W.D.S.), Washington University School of Medicine, St Louis, Mo; the Department of Medicine, University of Vermont College of Medicine, Burlington (B.L.); and the Department of Radiology, University of Alabama at Birmingham (A.J.J.). Received April 12, 2000; revision requested June 5; revision received July 28; accepted August 29. Address correspondence to A.J.J., Indiana University Radiology Research Institute 714 N Senate Ave, 1st Floor, Indianapolis, IN 46202 (e-mail: annejohn@iupui.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Six neuroradiologists interpreted a consecutive sample of 265 MR images in patients suspected of having stroke. Each read reduced-protocol images in a discrete series of 40 patients (one read images in only 15) and corresponding full-protocol images 1 month later (reduced/full protocol). Five of the readers each read images in 10 additional cases, five each as full/full and reduced/reduced protocol controls.
values between full and reduced protocols, reader assessment of protocol adequacy, confidence level, and need for additional sequences or examinations were evaluated.
RESULTS: In the reduced/full protocol, the
value for detecting ischemia was 0.797; and that for detecting any clinically important abnormality, 0.635. Statistically similar
values were found with the full/full control design (
= 0.802 and 0.715, respectively). The full protocol was judged more adequate than the reduced protocol (2.0 of 5.0 points vs 1.6, P < .001) and generated greater diagnostic confidence (8.6 of 10.0 points vs 8.9, P = .01), less need for additional sequences (2.7 of 6.0 points vs 1.5, P < .001), and more requests for additional examinations (28.4% vs 36.3%).
CONCLUSION: Disagreement between interpretations of reduced- and full-protocol images might be attributable to baseline-level intraobserver inconsistency, as demonstrated in control designs. A greater number of sequences did not lead to greater consistency.
Index terms: Brain, ischemia, 10.78 Brain, MR, 10.121413, 10.121419
| INTRODUCTION |
|---|
|
|
|---|
In late 1997, 26 MR imaging sequences for brain imaging were available at the Mallinckrodt Institute of Radiology. We performed eight to 10 of these sequences in our protocol (full protocol) for adults suspected of having stroke. Theoretic justification and anecdotal experience dictated which sequences were retained in the stroke protocol over the years. In general, the greater the number of sequences, the greater the imaging time per patient, film usage, and cost and, potentially, the fewer the number of patients undergoing imaging per day. Decreasing the number of sequences in a protocol, especially one as commonly used as that in adult stroke, could potentially decrease costs and patient backlog and increase patient satisfaction (due to decreased scheduling delays and imaging time) and the cost-effectiveness of this diagnostic examination. Routine use of intravenously administered MR imaging contrast agents in our full protocol is also associated with increased cost and some, albeit minimal, risk to the patient because of idiosyncratic or allergic-type reactions.
The purpose of our study was to compare the performance of a reduced (three-sequence) MR imaging protocol with that of a full (eight- to 10-sequence) MR imaging protocol in adults suspected of having stroke. We are ultimately interested in improving our function as diagnosticians. The validity of a diagnostic examination, such as MR imaging, essentially has three components: reliability (or consistency, defined as the capacity of a test to give the same result at repeated application), responsiveness, and discrimination. The latter two were beyond the scope of the current study. In particular, we did not examine the validity of our reduced protocol in discriminating between patient populations. Instead, we focused on reliabilityin particular, intraobserver reliability. We suspect that a greater amount of information (ie, the number of images and/or sequences) leads to some improvement in the reliability of interpretations by the same observer. In this study, we compared intraobserver reliability for detecting ischemia and other important abnormalities by using reduced versus full protocols and compared diagnostic confidence level, need for additional sequences, recommendations for additional diagnostic examinations, and adequacy of the examination for diagnosis by using reduced versus full protocols.
| MATERIALS AND METHODS |
|---|
|
|
|---|
After reviewing the results of this pilot study, five neuroradiology staff members (D.K.K., B.C.P.L., M.M.B., F.J.W., and M.H.G.) chose by means of consensus a reduced protocol of three sequences for the setting of stroke of uncertain age (Figure). (At our institution, the duration of symptoms was typically not known to the technologist or neuroradiologist at the time of imaging.) This reduced protocol is consistent with current literature concerning useful sequences in patients with stroke (48), although the exact content of such an ideal protocol will vary over time as newer sequences are developed. This group expressed an interest in adding a fourth "blood-sensitive" sequence such as echo-planar T2*-weighted imaging to the reduced protocol. However, at that time, this sequence was not in routine use at our institution in adult patients with stroke, since we still performed nonenhanced cranial computed tomography (CT) to exclude acute hemorrhage in essentially all patients with stroke; therefore, T2*-weighted imaging was not included in our reduced protocol.
|
Six neuroradiologists on the medical staff at our institution (D.K.K., M.M.B., M.H.G., B.C.P.L., C.J.M., and F.J.W.) participated in the study as readers who ranged in age from 43 to 66 (mean age, 53) years and in years of neuroradiologic experience from 4 to 30 (mean number of years of experience, 19).
Each reader was assigned a different set of 50 patients so that no two readers read the same patients images. Each set of 50 patients images was divided into three subsets: 40 patients in the primary study design, five in the secondary study design, and five in the tertiary study design. In the primary study design (reduced/full), each reader read a subset of 40 images from his series, first as a randomly ordered set of the sequences that formed the reduced protocol, and then 1 month later in the same patients as images from the set of sequences that formed the full protocol. The secondary design (reduced/reduced protocol control) consisted of each reader being shown first the reduced protocol images in five patients, and then, 1 month later, the identical reduced protocol images in these same five patients. The tertiary design (full/full protocol control) consisted of each reader being shown first the full-protocol images in five patients, and then, 1 month later, the identical full-protocol images in these same five patients.
The same reader interview (Appendix) was used for all study designs. The interviewer supplied patient age, sex, and history from the requisition for the image. With each reading, the reader was required to offer a diagnosis and a confidence level in that diagnosis on a scale from one (low) to 10 (high). Readers also were questioned regarding the overall adequacy of the images for diagnosis and the desire for additional MR imaging sequences or other diagnostic examinations. The interview was conducted by the first author, who used a standardized interview and data recording format.
Analysis
For detecting ischemia and other clinically important abnormalities, we calculated the percentage of agreement (the number of concordant readings divided by the total number of readings) and
values. When a finding is rare or common, the percentage of agreement tends to be high by chance alone. The
value is an adjustment of the percentage of agreement to accommodate chance agreement (9). It is calculated as (observed proportion of agreement - chance-expected proportion of agreement)/(1 - chance-expected proportion of agreement).
values range from -1.000 (perfect disagreement) to 1.000 (perfect agreement). Deciding quantitative levels for significance of
values is arbitrary, but the following guidelines for
values have been suggested: poor (<0), slight (00.200), fair (0.2100.400), moderate (0.4100.600), substantial (0.6100.800), and excellent (0.810 1.000) (9,10).
We performed the paired t test to compare means for diagnostic confidence level, rates of study adequacy, need for additional sequences, and recommendations for additional diagnostic studies. We also performed the Wilcoxon signed-rank test to evaluate the need for additional sequences, diagnostic confidence level, and rates of study adequacy. Results of these tests were reported as two-tailed analyses. To evaluate the effect of protocol on the rates of recommendation of additional diagnostic studies, we performed the McNemar
2 test of the difference between rates in paired data (11). P values for differences in
values were calculated by performing the z test (9). STATA version 6 (Stata, College Station, Tex) was used for statistical analysis.
| RESULTS |
|---|
|
|
|---|
|
), the reduced protocol performed fairly well, as compared with the full protocol (Table 2). For comparison, this performance was similar to that of the second reading in the full/full protocol control group. The degree of consistency of the reduced/full protocol design in detecting ischemia (
= 0.797) was similar to that observed within the full/full protocol control design (
= 0.802). Likewise, the reduced/reduced protocol control had an intrinsic consistency (
= 0.816) that was similar to that of the full/full protocol control and to that of the reduced/full protocol. The three designs did not differ significantly in
value (P > .8). (See Tables 3 and 4 for specific findings that represented disagreements in the reduced/full protocol design with regard to ischemia and other clinically important abnormalities, respectively.) Calculated for individual readers,
values ranged from 0.610 to 0.900.
|
|
|
values were relatively poor for the reduced/full, full/full, and reduced/reduced protocol designs (
= 0.553, 0.457, and 0.242, respectively). The differences in these
values were not significant (P
.40).
Reliability in Determining the Presence of Any Clinically Important Abnormality
Abnormalities considered clinically important included stroke, hemorrhage (excluding small chronic subdural hematomas), tumor or mass, meningeal disease, vasculitis, multiple sclerosis or other demyelinating disease, aneurysm, occlusion of a major artery, and hydrocephalus. Estimates of reliability were lower for determining the presence of any such abnormality than for detecting ischemia, with
values of less than 0.750 for the reduced/full, full/full, and reduced/reduced designs (Table 5). The small apparent differences in
values among the three designs were not significant (P > .6). Calculated for individual readers,
values ranged from 0.450 to 0.800.
|
|
|
|
2 test resulted in a P value of .01 for this difference being due to chance alone. The additional diagnostic examinations recommended included MR angiography, CT, conventional catheter angiography, Doppler ultrasonography of the carotid arteries, follow-up MR imaging, cerebrospinal fluid analysis, otolaryngologic consultation, and other miscellaneous examinations. | DISCUSSION |
|---|
|
|
|---|
values for diagnosing hemorrhage. In 1997, we continued to rely on nonenhanced cranial CT as the criterion standard for diagnosing acute intracranial hemorrhage (1217). In our study, we did not evaluate the utility of MR imaging for diagnosing hemorrhage. Because the full protocol provides substantially more information than does the reduced protocolabout three times as many imagesand because some of that information is redundant (ischemia is often visible with several sequences, for example), we expected the full protocol to provide additional opportunities for the reader to arrive at a consistent diagnosis. However, this did not happen. Regardless of the amount of information (reduced or full protocols), the interpretations were about equally reliable over 1 month.
The full protocol was more often considered adequate and generated greater confidence scores and fewer requests for additional MR imaging sequences than did the reduced protocol. As compared with the readers in the pilot studies performed to construct the reduced protocol, the readers in the primary study were more comfortable with the full protocol than with the reduced protocol.
It may be surprising that the full protocol generated significantly more requests for additional diagnostic testing. Determining the clinical utility of these requests was beyond the scope of this studywhether due to greater sensitivity (to additional true-positive abnormalities found by using additional diagnostic information) or to the higher false-positive rate of the full protocol. We are confident that the request for evaluation of the arterial system (with MR imaging, CT, or catheter angiography) would be considered appropriate in most patients with stroke.
Many extraneous sources of instability were eliminated by our research design. In the full/full and reduced/reduced control experiments, the same highly trained and experienced readers viewed exactly the same images of the same patients under similar conditions. The interval between interpretations for all three designs (46 weeks) was adequate to avoid reader recognition of patients at repeat reading.
Our study was limited by the relatively small set of readers from one institution. The formal interview may have increased reliability by systematically bringing the readers attention to the task of identifying ischemia, bleeding, and other abnormalities. Our ability to detect differences in pairs of
values was limited by our sample size. For ischemia (Table 2) and any clinically important abnormalities (Table 5), differences of about 0.3 would have been detectable, with a P value of less than .05. Thus, the small number of cases in the full/full and reduced/reduced control designs limited statistical power. Because of changes in the clinical protocols in use over time, the full protocol changed minimally from patient to patient over the duration of the study. All patients were chosen from a library of cases that occurred before the routine use of "blood-sensitive" sequences at our institution. It is possible that these sequences or another innovation would have a beneficial effect on reliability. Arguments could certainly be made for including different sequences in either the reduced or the full protocol. Also, sequences that are available at the date of publication of this study (including perfusion imaging) differ from those that were available at the time of our investigation. In our opinion, such new innovations merit additional investigation (eg, including tests of reliability and discrimination). Our study focused on recognizing areas for potential improvement in diagnostic function rather than on the latest evolving imaging technology.
We did not perform an accuracy study. We did not attempt to define a final reference diagnosis in each patient and do not know which interpretations best approximated the patients "true state." Similarly, without long-term follow-up, we were unable to distinguish which lesions resulted in substantial clinical dysfunction. Nonetheless, accuracy, in the sense of concordance with the findings of a reference examination, is not achievable if the index examination is unreliable. An examination that is unreliable cannot be accurate. We note that the reduced protocol is probably not appropriate for patients with a very gradual onset of symptoms, with symptom duration of longer than a few weeks, or with other manifestations suggesting a reasonable possibility of tumor or infection.
The degree of reliability of the interpretations in this study should be put into context. MR imaging has been widely embraced in the setting of stroke because of its superior diagnostic usefulness relative to alternative diagnostic examinations (4,1821). But are we using this technology to maximal benefit in this clinical setting? Some analysts describe
statistics in the 0.4100.600 range as "moderate," in the 0.6000.800 range as "substantial," and greater than 0.800 as "excellent" (9). However, if the clinical stakes are high, a
value of 0.800 may represent a level of performance that should perhaps be improved. Misinterpretation of ischemia 8%10% of the time in patients with stroke (
= 0.800), as we observed, could mean that thousands of American adults are denied antiischemic agents. As an alternative, it could mean that thousands are inappropriately given these potentially dangerous agents (2224).
Why are readings not more reliable? Although we did not formally investigate this question, we offer several subjective observations. First, MR image interpretation is difficult and complex. It imposes a large cognitive burden on the reader and demands a high level of concentration. The amount of information resulting from even a single sequence is substantial and may be compounded in larger protocols. Second, readers must often perform without clues from the history (eg, duration of ischemic symptoms), physical examination, laboratory test results, or prior radiographs that would direct their attention to relevant aspects of the images. Third, readers sometimes approach images without a systematic strategy. Little in the environment or process of reporting helps to guide the reader. Images are rarely overread by a second radiologist. Fourth, training in radiology may emphasize differential diagnosis and maximum sensitivity rather than reliability. Neither trainees nor practicing radiologists are routinely informed of their own reliability. Finally, many radiologists and other physicians strive to perform at a high level, without recourse to routine aids such as checklists, guidelines, or standardized reporting criteria.
We wonder whether patients and clinicians expect or should reasonably expect readers to agree with themselves nearly 100% of the time under ideal testing circumstances. We know that even laboratory blood tests are not 100% reliable. However, the fact that 15%20% of the time in our small control series, under nearly ideal conditions, the same readers gave inconsistent interpretations of major findings such as ischemia or other clinically important abnormalities warrants further investigation. Can similar consistency concerns be demonstrated in other modalities and clinical settings in medicine? Most important, what can be done to improve the reliability (and therefore the clinical utility) of imaging in patients with stroke? Reliability issues have been described in other diagnostic areas (25), but to our knowledge, no comprehensive strategy for addressing them is available. We propose that reliability is an improvable aspect of image interpretation and that changes in training, reporting format, equipment, environment, and physician behavior should be investigated to address this issue.
Our data suggest that in the clinical setting of suspected stroke, interpretations resulting from a reduced MR imaging protocol of three sequences disagreed about 15%20% of the time with those resulting from a full protocol of eight to 10 sequences for detecting ischemia and other clinically important abnormalities; however, similar disagreement was seen among control interpretations. Readers were more comfortable with the full protocol. However, both protocols had intrinsic inconsistencies and unreliability that may have limited their accuracy and clinical usefulness. Before the diagnostic usefulness of any reduced protocol is further investigated (eg, by performing tests of accuracy and responsiveness to change), the reliability issues raised by less than optimally consistent interpretations in the controls must be investigated.
| APPENDIX |
|---|
|
|
|---|
Date of MR: ______Patient #_______________ History: ____YO_________________________
Please try not to look at the patients name while reviewing the scan. Please carefully review the images provided, and then answer the questions below based upon this scan.
Yes 2
No
Yes 2
No 3
Not sure
Hyperacute 2
Acute 3
Subacute 4
Chronic
None 1
Mild 2
Moderate 3
Severe
Small 2
Medium 3
Large
Right 1
Left
2
Bilateral
ACA 2
MCA
3
PCA
4
SCA 5
AICA
6
PICA
7
Other__________
NoIm comfortable reading the scan as is.
Noeven though Im uncomfortable, I would not call the patient back.
Yesbut only if the patient was still in the scanner.
Yesbut only if the patient was still in the MR suite.
Yesbut only if the patient was still in the hospital.
Yeseven if the patient has left the hospital.
Yes 2
No
Yes 0
No Thank you for completing this questionnaire.
Date: _____________ NR: _____________
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
W. A. Willinek, J. Gieseke, M. von Falkenhausen, B. Neuen, H. H. Schild, and C. K. Kuhl Sensitivity Encoding for Fast MR Imaging of the Brain in Patients with Stroke Radiology, September 1, 2003; 228(3): 669 - 675. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |