|
|
||||||||
Evidence-based Practice |
1 From the Duke Center for Clinical Health Policy Research, 2200 W Main St, Suite 220, Durham, NC 27705 (M.B.P., D.C.M., D.B.M., G.P.S.); Departments of Medicine (D.C.M., D.B.M.) and Community and Family Medicine (G.P.S.), Duke University Medical Center, Durham, NC; Department of Veterans Affairs, Durham, NC (D.C.M., D.B.M.); and Department of Medicine, Geneva University Hospital, Switzerland (O.T.R.). Received December 2, 2002; revision requested February 21, 2003; final revision received August 27; accepted September 29. Supported by the Agency for Healthcare Research and Quality contract No. 29097-0014, task order 7. Address correspondence to M.B.P. (e-mail: meenal.p@duke.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Articles published between 1989 and 2003 were identified in the MEDLINE, CINAHL, and HealthSTAR databases. Articles were selected if FDG PET was performed with a dedicated scanner and the resolution was specified, if standard criteria were used for the diagnosis of Alzheimer disease, if at least 12 human subjects with Alzheimer disease were enrolled in the study, if clinical diagnosis or histopathologic findings were used as the reference standard, and if sufficient data were provided to construct a 2 x 2 table. Two reviewers independently abstracted data regarding the operating characteristics (sensitivity and specificity) of PET and evaluated the study quality. A meta-analysis was performed by constructing a summary receiver operating characteristic curve and by combining the sensitivity and specificity values by using a random-effects model.
RESULTS: Fifteen articles that met the inclusion criteria showed heterogeneity in sensitivity and specificity estimates that were not related to quality features with no plausible explanations. The summary sensitivity of PET was 86% (95% CI: 76%, 93%), and the summary specificity was 86% (95% CI: 72%, 93%).
CONCLUSION: The specificity and sensitivity of FDG PET are limited by both study design and patient characteristics. Therefore, the clinical value of these parameters is uncertain; future research on the use of PET in the diagnosis of Alzheimer disease needs to focus on current limitations to be of practical relevance in clinical settings.
© RSNA, 2004
Index terms: Alzheimer disease Positron emission tomography (PET), comparative studies Receiver operating characteristic (ROC) curve
| INTRODUCTION |
|---|
|
|
|---|
The current standard for the diagnosis of Alzheimer disease recommended by the American Academy of Neurology, or AAN, is based on clinical evaluation and includes a complete patient history, physical and neuropsychiatric evaluation, and screening with laboratory testing. Therefore, the diagnostic criteria of the National Institute of Neurological and Communicative Disorders and Stroke/Alzheimers Disease and Related Disorders Association (NINCDS/ADRDA) and the Diagnostic and Statistical Manual with revised diagnostic criteria for Alzheimer disease (4) are sufficiently reliable and valid, and they should be used for clinical evaluation (5). In addition, the criteria also call for the use of anatomic neuroimaging with computed tomography (CT) or magnetic resonance (MR) imaging without the use of contrast material (2). Although not currently recommended in the routine evaluation of dementia, other tests such as functional neuroimaging have also been proposed for the evaluation of individuals who may have Alzheimer disease. Two approaches to functional neuroimaging include single photon emission CT (SPECT) and positron emission tomography (PET) with use of markers for cerebral blood flow or glucose metabolism.
SPECT and PET can demonstrate metabolic abnormalities that correlate with the expected anatomic pattern of involvement in Alzheimer disease (bilateral hypometabolism of temporal and parietal lobes). These abnormalities can be differentiated from findings suggestive of vascular cause (asymmetric and focal abnormalities) and from frontal lobe or temporal lobe dementias (hypometabolism of frontal or temporal lobes with sparing of parietal lobes). They can also be differentiated from findings in persons without dementias (5,6).
Current therapies for Alzheimer disease are aimed at symptomatic relief and slowing of disease progression. Further refinement in treatment strategies may halt progression of the disease (7), however, and treatment will be especially valuable if it can be applied before functional dementia sets in.
Despite our current limitations in the treatment of Alzheimer disease, the diagnosis is evidently an important issue. We therefore decided to study the operating characteristics (sensitivity and specificity) of fluorine 18 fluorodeoxyglucose (FDG) PET in the diagnosis of Alzheimer disease, which is a test that is being used for this purpose.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Two reviewersone from each team (M.B.P., D.C.M., D.B.M., O.T.R.)reviewed the abstracts of all retrieved articles. If an abstract was selected by either of the two reviewers, the full text version of the article was obtained. References from these articles were also examined, and pertinent articles were acquired.
Study Eligibility
We, the methodologists, employed the following inclusion criteria to select full-text articles for data abstraction: (a) Articles had to be written in English, include primary data, and be published in a peer-review journal; (b) studies had to include at least 12 human subjects with the disease of interest; (c) for studies of PET operating characteristics, either clinical diagnosis (according to standard criteria of the NINCDS/ADRDA or the Diagnostic and Statistical Manual [4]) or histopathologic diagnosis had to be used as the reference standard; and (d) sufficient data had to be provided either directly or indirectly through a 2 x 2 table to be able to calculate point estimates and 95% CIs for the operating characteristics of sensitivity and specificity. This last criterion implies that patients with and those without Alzheimer disease and patients with positive and those with negative PET results were included in the study. Two reviewers of the team (M.B.P., D.C.M., D.B.M., O.T.R.) independently evaluated articles for inclusion and resolved any disagreements by means of discussion.
Data Abstraction
We, the methodologists, developed data abstraction forms, and two reviewers (one from each team) independently completed the abstraction of key data from the articles selected in the previous step. This included 2 x 2 tables and other data used in the analysis of the operating characteristics of the test. Data from the two reviewers were compared, reconciled, entered into a computer database by one reviewer of the team, and verified by another.
Study Quality
To assess the quality of the studies selected, we devised a rating scale that consisted of eight criteria. (We used the opinion of a nuclear radiologist to derive technical quality criteria for FDG PET.) The criteria were the following: (a) The scanner model or type and resolution of the scanner were mentioned, (b) the setting and selection of the population under investigation were clearly described, (c) the study had a representative sample of patients with an appropriate spectrum of disease, (d) the results were categorized by disease severity, (e) standard criteria were used for image interpretation, (f) histopathologic or clinical confirmation was performed by using standard criteria (eg, NINCDS/ADRDA or Diagnostic and Statistical Manual criteria [4] were used on the basis of long-term follow-up of 1 year or more), (g) follow-up was completed (there was no verification bias), and (h) the image reader and the person who assigned the reference standard diagnosis were blinded to clinical diagnosis.
For each of these criteria, a score of 0 or 1 was assigned. A score of 0 was assigned if the study did not adequately meet the criterion or if the data were inadequate to determine the criterion, and a score of 1 was assigned if the study met the criterion. The scores were added to give a final quality score for the study. When the full-text articles were reviewed, the two reviewers, one from each team, scored the articles. If there was a disagreement in scores, they resolved it by means of discussion.
Data Synthesis and Statistical Analysis
We, the methodologists, derived summary statistics of the studies by means of the total quality score and also by means of various components of the quality score. We also observed some features of the studies that were not included in the quality score. Since these features were relevant, we derived summary statistics for the following: (a) studies that included the more stringent criterion of patients with probable Alzheimer disease versus those with possible Alzheimer disease and (b) studies in which clinical diagnosis was used as the reference standard versus those in which histopathologic findings were used as the reference standard.
We divided the studies into two major groups: (a) studies in which patients with Alzheimer disease were compared with healthy control subjects and (b) studies in which patients with Alzheimer disease were compared with those with dementias not caused by Alzheimer disease. We performed meta-analyses of the first group of studies to quantify the sensitivity and specificity of PET in the diagnosis of Alzheimer disease. We described a single set of operating characteristics (sensitivities and specificities) for PET across multiple studies by constructing a summary receiver operating characteristic (ROC) curve, and we separately combined the sensitivity and specificity values across studies by using a random-effects model. This results in wider 95% CIs than those with a fixed-effects model and is therefore more conservative (10).
The ROC curve is a plot of pairs of possible combinations of true-positive and false-positive ratios achievable with a test as the positivity criterion is varied. The curve is thereby used to evaluate overall test performance independent of the ultimately chosen decision criteria. The area under the ROC curve is a summary measure of test performance (1113).
Statistical analyses involving both the random-effects model and the summary ROC curve were performed by using a computer program (Meta-Test version 0.6; Lau Joseph, Boston, Mass) (14). We report 95% CIs with all estimates.
For the articles that addressed PET performance in patients with Alzheimer disease compared with that in healthy control subjects, we plotted space studies in the ROC curve with and without the following characteristics: (a) patients with mild Alzheimer disease included, (b) standard criteria for interpretation of PET images used, (c) quantitative criteria used for interpretation of PET images, (d) PET images interpreted by readers who were blinded to clinical diagnosis, and (e) patients with probable Alzheimer disease included. In addition, we also evaluated the association of PET operating characteristics with the total quality score.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
|
|
Eight of the 15 (53%) studies were conducted in tertiary care settings (15,16,1823). In the seven studies that did not explicitly mention the setting, information in the text suggested that these were also tertiary care centers.
Four of the 15 (27%) studies (15,22,24,25) included a representative sample of patients with mild dementia to severe dementia, with a clinical dementia rating of 13 (26). Two (13%) studies involved subclassification of results according to the degree of dementia (15,25). Two studies included patients with mild cognitive impairment, based on a clinical dementia rating of 0.5 (the operative definition of mild cognitive impairment) (15,16). However, investigators in these two studies did not report separate results of PET in mild cognitive impairment.
All investigators verified the presence or absence of dementia by means of clinical examination. In addition, investigators in two of the 15 (13%) studies confirmed the diagnosis of Alzheimer disease by means of histopathologic examination after a follow-up period (16,23). In one study, investigators verified the diagnosis by means of histopathologic examination in only five of 65 patients with Alzheimer disease (19).
The image reader and the person who assigned the diagnosis based on the reference standard were blinded to clinical diagnosis in eight (53%) studies (15,16,19,22,24,25,27,28).
Two study features were not evaluated in the quality score, but they deserve special mention. First, 12 of the 15 studies included patients with a clinical diagnosis of probable Alzheimer disease according to NINCDS/ADRDA criteria (1521,23,2730). One study also included patients with possible Alzheimer disease (23); two studies did not classify patients as having possible or probable Alzheimer disease according to NINCDS/ADRDA criteria (22,24). In 13 studies, NINCDS/ADRDA clinical criteria were used as the standard for assessing the value of FDG PET; histopathologic examination was used as the standard in only two studies (16,23). The sensitivity of PET in these two studies ranged from 88% to 95%, and the specificity ranged from 62% to 74%.
Another study aspect that needs to be mentioned because of its potential effect on the estimated diagnostic performance of FDG PET is variability in the condition of patients in each study. In six studies, the performance of PET in patients with Alzheimer disease was compared with that in patients with dementias of other causes (Tables 2, 3). Four articles addressed two comparisons each: PET performance in Alzheimer disease versus nonAlzheimer disease dementias and PET performance in patients with Alzheimer disease versus healthy subjects (16,2224). Five of these six articles provided the operating characteristics of PET in the differentiation of Alzheimer disease dementias and specific causes of nonAlzheimer disease dementias, including multiinfarct dementia (22), vascular dementia (17), and dementia with Lewy bodies (18,24,28). In these studies, the sensitivity of PET ranged from 75% to 92%, and the specificity ranged from 18% to 86%. The remaining study included all other causes of nonAlzheimer disease dementias, and this group was compared with the Alzheimer disease dementia group. That study included patients with mild to moderate dementia exclusively, with a PET sensitivity of 94% and a specificity of 73%. The investigators in that study did not clearly report the distribution of dementia severity either globally or by means of a clinical dementia rating (16).
We selected 15 studies for detailed review. However, five of these did not explicitly report sensitivity and specificity values. Three provided plots of metabolic ratio for patients with Alzheimer disease and control subjects (18,21,28), and one presented the ROC curves for the test (24). None of these studies allowed us to estimate with reasonable certainty the number of subjects in each cell of a 2 x 2 table because of the resolution of the graphs and the lack of numeric data.
The remaining nine studies were deemed suitable for meta-analysis (15,17,19,20,22,24,25,29,30).
Meta-Analysis
In the nine studies in which patients with Alzheimer disease were compared with healthy control subjects, the sensitivity ranged from 61% to 100%, and specificity ranged from 54% to 100%. A random-effects model was used to calculate a pooled sensitivity estimate of 86% (95% CI: 76%, 93%) and a pooled specificity estimate of 86% (95% CI: 72%, 93%).
The nine studies provided estimates of test outcomes that were quantitatively heterogeneous, whether evaluated as sensitivity or specificity or as a summary statistic of an ROC curve. To assess possible explanations for this heterogeneity in operating characteristics, we used the ROC curve to plot studies with and those without the following characteristics: patients with mild Alzheimer disease included, patients with probable Alzheimer disease included, standard criteria for interpretation of PET used, quantitative criteria used for interpretation of PET, PET images interpreted by readers who were blinded to clinical diagnosis, and association of operating characteristics with the total quality score. Neither these features nor the total quality score of the studies appeared to explain the observed variations in operating characteristics of PET.
We also analyzed separately the 12 studies that included patients with the stringent NINCDS/ADRDA criteria of probable Alzheimer disease and the one study that included patients with possible Alzheimer disease. We did not observe any evident trend in the operating characteristics of PET.
We did not perform meta-analysis on the six studies in which Alzheimer disease dementias and those with other causes were compared, since five studies involved comparison of PET performance in patients with Alzheimer disease dementias with that in patients with dementias with other causes. We plotted the specificity and sensitivity of PET in these studies (Figure, part b), and PET appears to have a lower overall specificity in these situations.
|
|
| DISCUSSION |
|---|
|
|
|---|
The specific limitations of these studies lie in the use of an imperfect reference standard, presence of verification bias, and inconsistencies in the choice of spectrum of disease, which introduce bias and limit generalizability (31,32).
The reference standard used in most identified studies is clinical diagnosis; this is also the standard used in clinical trials and in assessment of response to treatment. When compared with histopathologic findingsarguably the standard for the diagnosis of Alzheimer diseaseclinical diagnosis is an imperfect reference standard; the sensitivity of clinical diagnosis of possible Alzheimer disease according to the NINCDS/ADRDA criteria is 81%, while the specificity is 70% when compared with an autopsy criterion standard (5). Studies have shown that when the reference standard provides imperfect information, the estimated ROC curve gives a biased estimate of the true ROC curve (33). However, it is relevant to note that even autopsy may not provide a perfect diagnosis, and a definitive diagnosis may never be obtained (34). For purposes of assessing the diagnostic performance of a test, it is essential to recognize that results depend on the chosen reference standard, and interpretation ultimately depends on how well that reference standard serves the clinical objective.
Verification bias was a common problem in the identified studies. Verification bias occurs when only some patients are evaluated with the reference standard (35), and only these patients are used to estimate the accuracy of the test. When verification is limited to patients with positive PET findings or with other characteristics that suggest a diagnosis of Alzheimer disease, this can lead to substantial overestimation of sensitivity and over- or underestimation of specificity. A longitudinal FDG PET study (36) used to predict future cognitive decline and mild cognitive impairment among healthy elderly patients portrayed another problem that can arise with incomplete verification. Since verification of the presence or absence of disease with long-term follow-up was done for all patients in just two of the studies analyzed, it is possible that some patients with false-positive findings would actually have had true-positive findings, had adequate follow-up been performed for all patients in all studies.
The greatest limitation of the studies on the evaluation of the operating characteristics of PET in the diagnosis of Alzheimer disease was the lack of a representative sample of patients. This can limit the generalizability of the study results of PET and can introduce substantial bias in the estimates of test performance. In most studies, evaluation of PET was performed in neurology clinics or in settings that appeared from the description to be tertiary care. However, it is reasonable to expect that a substantial number of test candidates will come from primary care settings. Even in the absence of a sample of patients from primary care settings, this problem could be addressed by enrolling a complete spectrum of patients, ranging from those with mild cognitive impairment to those with severe dementia, and by analyzing PET results according to disease severity. This was not done in most studies of PET.
Failure to study a representative group of patients can lead to spectrum bias (31). As has been shown for treadmill testing in the context of possible coronary artery disease, comparison of unwell patients with well patients can lead to substantial overestimates of test sensitivity and specificity. Most studies of PET involved the use of healthy patients for comparison. Notably, test performance in studies of FDG PET to distinguish dementia caused by Alzheimer disease from all other causes suggests that spectrum bias is a concern in the PET literature. Since our meta-analysis results indicate that PET has a similar sensitivity but a lower specificity when patients with dementia caused by Alzheimer disease are compared with those with all causes of dementia than when patients with and those without dementia are compared, the resulting estimates of operating characteristics are likely an upper bound.
To conclude, this study provides a preliminary estimate of the operating characteristics of FDG PET in Alzheimer disease, as well as guidance for future research. Although the methodologic problems of available studies create substantial limitations, the meta-analysisderived estimates of sensitivity and specificity can be treated as a reasonable approximation of the upper bound of test performance. We have used a random-effects model for meta-analysis, and this could bias the analysis because, as opposed to the fixed-effects model, it gives smaller studies proportionally greater weight in the pooled estimate (10,11). However, this will only provide a more conservative estimate than the fixed model unless it is affected by publication bias and will exaggerate the value of PET.
The ultimate aim of any diagnostic strategy is to establish its value; future research therefore needs to be carefully planned with that end in mind. An innovative study design has been suggested for the purpose of evaluating diagnostic imaging technologies. A randomized controlled trial that reflects the effect of technology on the clinical decision-making process and outcome measures that track trends in outcomes over time may be two of the most relevant components of the study design (37,38).
The Standards for Reporting of Diagnostic Accuracy, or STARD, initiative suggests a checklist that should be used to improve accuracy and completeness when reporting studies of diagnostic accuracy (39). Future researchers who evaluate the role of PET in dementia would benefit immensely from studying this list and considering the utility of their study design before planning and reporting their findings.
In addition to the broader suggestions mentioned, in the current study we identified limitations in existing research and have suggested the following specific opportunities for improvement. (a) The sample of patients in the test evaluation should be representative of those who would be test candidates in actual practice. For example, the test should be evaluated in patients from a variety of clinical settings with suspected dementia as opposed to patients from specialty clinics with evident dementia and nonimpaired control subjects. (b) Clear thresholds should be identified for interpretation of presence of disease. (c) Long-term follow-up should be performed to verify the presence or absence of Alzheimer disease. (d) Results should be analyzed according to severity of cognitive impairment.
As therapeutic options expand for individuals with Alzheimer disease, particularly those with earlier stages of disease, accurate diagnosis of disease in individuals with cognitive impairment and those at risk for Alzheimer disease will become increasingly important. Consequently, well-designed studies of new tests will become an even greater imperative.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Author contributions: Guarantors of integrity of entire study, M.B.P., D.C.M., D.B.M.; study concepts and design, all authors; literature research, M.B.P., D.C.M., D.B.M., O.T.R.; data acquisition, M.B.P., D.C.M., D.B.M.; data analysis/interpretation, all authors; statistical analysis, M.B.P., D.C.M., D.B.M., G.P.S.; manuscript preparation, M.B.P., D.C.M., D.B.M., G.P.S.; manuscript definition of intellectual content, all authors; manuscript editing, M.B.P., D.C.M., D.B.M., G.P.S.; manuscript revision/review and final version approval, all authors
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. Ng, V. L. Villemagne, S. Berlangieri, S.-T. Lee, M. Cherk, S. J. Gong, U. Ackermann, T. Saunder, H. Tochon-Danguy, G. Jones, et al. Visual Assessment Versus Quantitative Assessment of 11C-PIB PET and 18F-FDG PET for Detection of Alzheimer's Disease J. Nucl. Med., April 1, 2007; 48(4): 547 - 552. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Hachinski, C. Iadecola, R. C. Petersen, M. M. Breteler, D. L. Nyenhuis, S. E. Black, W. J. Powers, C. DeCarli, J. G. Merino, R. N. Kalaria, et al. National Institute of Neurological Disorders and Stroke-Canadian Stroke Network Vascular Cognitive Impairment Harmonization Standards Stroke, September 1, 2006; 37(9): 2220 - 2241. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |