|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Breast Imaging |
1 From the Department of Radiology, University of Michigan Medical Center, CGC B2102, 1500 E Medical Center Dr, Ann Arbor, MI 48109-0904. From the 2003 RSNA Annual Meeting. Received August 31, 2005; revision requested November 3; revision received March 9, 2006; accepted April 4; final version accepted June 6. Supported in part by U.S. Army Medical Research Materiel Command grant DAMD17-01-1-0328 and by U.S. Public Health Service grants CA095153 and CA091713. Address correspondence to B.S. (e-mail: berki{at}umich.edu).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: Informed consent and institutional review board approval were obtained. Our data set contained 3D US volumetric images obtained in 101 women (average age, 51 years; age range, 2586 years) with 101 biopsy-proved breast masses (45 benign, 56 malignant). A computer algorithm was designed to automatically delineate mass boundaries and extract features on the basis of segmented mass shapes and margins. A computer classifier was used to merge features into a malignancy score. Five experienced radiologists participated as readers. Each radiologist read cases first without computer-aided diagnosis (CAD) and immediately thereafter with CAD. Observers' malignancy rating data were analyzed with the receiver operating characteristic (ROC) curve.
Results: Without CAD, the five radiologists had an average area under the ROC curve (Az) of 0.83 (range, 0.810.87). With CAD, the average Az increased significantly (P = .006) to 0.90 (range, 0.860.93). When a 2% likelihood of malignancy was used as the threshold for biopsy recommendation, the average sensitivity of radiologists increased from 96% to 98% with CAD, while the average specificity for this data set decreased from 22% to 19%. If a biopsy recommendation threshold could be chosen such that sensitivity would be maintained at 96%, specificity would increase to 45% with CAD.
Conclusion: Use of a computer algorithm may improve radiologists' accuracy in distinguishing malignant from benign breast masses on 3D US volumetric images.
Supplemental material: http://radiology.rsnajnls.org/cgi/content/full/2423051464/DC1
© RSNA, 2007
| INTRODUCTION |
|---|
|
|
|---|
Ultrasonography (US) is an important imaging modality in the characterization of breast masses. For differentiation of simple cysts from other lesions, interpretation of US images by experienced breast radiologists results in an accuracy close to 100% (7). In current clinical practice, if a palpable or mammographically suspicious mass cannot be confidently categorized as a cyst on US images, it is often recommended for biopsy. Several reports (810) have indicated that the improvement in US imaging technology and the interpretation of US images by experienced radiologists may make it possible to characterize solid breast masses as malignant or benign with a high level of accuracy.
Several groups of researchers have been developing methods for computerized characterization of masses on two-dimensional (2D) US images (1114). We have developed an automated computer classifier for differentiation of malignant and benign breast masses on three-dimensional (3D) US volumetric images (15). Thus, the purpose of our study was to retrospectively investigate the effect of using the computer classifier we developed on radiologists' sensitivity and specificity for discriminating malignant masses from benign masses on 3D volumetric US images, with histologic analysis serving as the reference standard.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
The B-mode images were recorded in the memory buffer of the US scanner. After data acquisition, US images and position data were transferred digitally to the workstation, where individual planes were cropped and stacked to form a 3D volumetric image. The biopsy-proved mass on each image was identified by a Mammography Quality Standards Actqualified radiologist (M.A.R., 8 years of experience in breast US imaging, referred to as radiologist 0 hereafter), who used clinical US and mammographic images to confirm that the 3D images contained the mass of interest and showed the mass in its entirety.
Computerized Classification of Masses in US Volumetric Data Sets
The first step of computerized analysis (15) involved extraction of the mass boundaries in the 3D volumetric data set (ie, mass segmentation). Automated segmentation of breast masses on US images is a difficult task because of image speckles, posterior shadowing, and variations of the gray level both within the mass and within the normal breast tissue. We developed a 3D active contour model for segmentation (Fig 2). The active contour model combined prior knowledge about the relative smoothness of the 3D mass shape on the US volumetric image with information in the image data.
|
|
Observer Performance Study
Five radiologists (M.A.H., C.P., J.B., A.V.N., C.B.), who were referred to as radiologists 15, had 326 years of experience in mammographic and breast US image interpretation. They were all Mammography Quality Standards Act qualified, and four were fellowship trained in breast imaging. In our department, about 4300 breast US examinations are performed annually.
An interactive graphical user interface facilitated navigation through the scanned 3D volumetric images of interest that contained the mass and allowed adjustment of the window and level settings of the displayed images. The location of the mass of interest, as determined by radiologist 0 with all available imaging and histologic findings, was marked on each section so that all radiologists would rank the same mass and ignore others if more than one mass could be seen on the volumetric image.
Observers first interpreted studies without CAD. This involved assessing the mass for shape, margins, echogenicity, cystic versus solid appearance, and through transmission, as well as estimating the likelihood of malignancy (LM) on a scale of 0% to 100%. For assessment of mass characteristics, the radiologists chose terms from a list of descriptors that were similar to but not exactly the same as those in the US Breast Imaging Reporting and Data System lexicon of the American College of Radiology, as this observer study was performed before the lexicon was published. For example, the descriptors for shape were "oval," "round," "lobulated," and "irregular," whereas the descriptors for margins were "circumscribed," "spiculated," "microlobulated," and "ill defined."
A button corresponding to an LM rating of 0% was provided for benign masses. Another button corresponding to LM ratings of less than 2% was provided for probably benign masses. This second button was set to correspond to Breast Imaging Reporting and Data System category 3 (ie, probably benign) lesions, for which short-interval follow-up is recommended (20). The radiologists used a slide bar to enter ratings between 2% and 100%. The discrete buttons facilitate the selection of the LM ratings more precisely for the benign and probably benign masses because our previous experience indicates that the uncertainty of observers when selecting ratings on a slide bar can be much greater than 2%. The observers were reminded at the beginning of the study that if they rated a mass as having an LM of more than 2% it would indicate that they would recommend the mass for biopsy (20,21). The assessment and the LM estimate were based on the findings in all of the volumetric images (stack of sections) that contained the mass.
We used a two-step sequential reading design, which was found to be a sensitive technique for assessing the difference between the two conditions in previous studies (6,22). Immediately after reading without CAD, the computer-estimated malignancy score for the study was displayed on the screen and the radiologist estimated the LM with CAD. The estimate of LM without CAD was stored in a computer file, and the radiologist was unable to modify it after seeing the computer-estimated score. The computer-estimated malignancy score was linearly mapped to an integer between 1 and 10 before the score was displayed on the graphical user interface. To provide radiologists with a reference of computer performance, the Gaussian distributions fitted to the computer scores for the malignant and benign lesions were also displayed on the interface. The radiologists could keep their original malignancy rating or change it by using the slide bar after they considered the computer-estimated score. The radiologists were not informed whether a mass was malignant or benign during or after the study, and the overall results of their assessment were not discussed with them before the study was completed.
There was no time limit for the radiologists to assign an LM rating. The radiologists were not informed of the proportion of malignant masses. The study reading order was randomized for each radiologist. To reduce fatigue, each radiologist read the data set in three separate sessions. The three sessions were separated by at least 2 days and at most 1 month. Before participating in the study, the radiologists were trained with five studies that were not part of the test set. They were familiarized with the study design, the functions on the graphical user interface, and the relative malignancy rating scale of the computer during the training session.
The data set used in this investigation was also used in an earlier study to develop the CAD technique (15). Three radiologists in the current investigation had already assigned an LM score for these masses without use of CAD in our earlier study, which had a different experimental design and involved use of a different graphical user interface. (Radiologists 1, 2, and 3 in the current study were referred to as radiologists 3, 4, and 2, respectively, in the earlier study.) The reading sessions in the past and current studies were separated by at least 6 months. The radiologists were not informed whether a mass was malignant or benign during or after the earlier study. The accuracies of these three radiologists in assigning LM scores without CAD in these two studies were compared.
Data and Statistical Analyses
There is no reference standard for mass characteristics since they are judged subjectively by radiologists. Thus, a majority assessment (ie, the mode) for each characteristic was determined according to majority rule by the six radiologists (radiologists 05). For example, if one radiologist described the echogenicity characteristics of a mass as hypoechoic, three described them as markedly hypoechoic, one described them as anechoic, and one described them as heterogeneous, the majority assessment for echogenicity of the mass would be markedly hypoechoic. When there was a tie between two descriptors, we used the descriptor chosen by radiologist 0who was very familiar with the cases owing to her role in data collectionas the tie breaker. If there was a tie and the original descriptor provided by radiologist 0 was not one of the descriptors that were tied, radiologist 0 was asked to re-read the images and choose one of the tied descriptors. An alternative to the majority rule for summarizing the central tendencies is to use the mean of each descriptor. In this study, we chose to use the mode because we were interested in how each mass could be characterized and in the overall central tendency.
The LM ratings of the radiologists with and without CAD were evaluated with receiver operating characteristic (ROC) curve analysis (23,24). The area under the ROC curve (Az) and the partial area index above a sensitivity of 0.9 (A
) (25) were used as measures of accuracy. For an individual radiologist, the significance of the change in accuracy with CAD was also analyzed with the ROC method. For the group of five radiologists, the significance of the change in accuracy with CAD was tested with the Dorfman-Berbaum-Metz multireader multicase method (26) and the Student two-tailed paired t test (Microsoft Excel, version 2002; Microsoft, Redmond, Wash). The Dorfman-Berbaum-Metz method (http://xray.bsd.uchicago.edu/krlbp/KRL_ROC/) is normally the preferred method used to analyze the Az values for multireader multicase data because it accounts for both reader and case variances, whereas the t test does not account for case variance in calculation of the P value. Therefore, conclusions drawn from the t test can be generalized to the population of readers but not to the population of cases. The t test was applied to the evaluation of A
. For this task, we are unaware of any available software that can account for both reader and case variances.
The sensitivity and specificity of each radiologist with and without CAD were compared by using an LM rating of 2% as the threshold above which biopsy would be recommended (20,21). The radiologists in our study were familiar with Breast Imaging Reporting and Data System recommendations and were well aware that selecting an LM of more than 2% would be the equivalent of declaring that the mass was suspicious enough to warrant biopsy. If the radiologist intended to indicate an LM of less than 2%, he or she selected one of two graphical user interface buttons designated benign and less than 2% LM. The buttons were clearly labeled "benign" and "probably benign."
In addition to testing an LM rating of 2%, we also tested a hypothetical biopsy threshold of LM with CAD. This hypothetical threshold was chosen to maintain the average sensitivity of the radiologists at the same level as that without CAD. We could then evaluate the change in specificity if the sensitivity was maintained before and after use of CAD.
To investigate whether the change in sensitivity with CAD was statistically significant for a given radiologist, we used the McNemar test (WinStat, version 2005.1; R. Fitch Software, Lehigh Valley, Pa) and considered the number of beneficial and detrimental changes in biopsy recommendation for malignant masses with CAD. If a malignant mass was not recommended for biopsy without CAD but was recommended for biopsy with CAD, this was defined as a beneficial change. If a malignant mass was recommended for biopsy without CAD but was not recommended for biopsy with CAD, this was defined as a detrimental change. We similarly applied the McNemar test to benign masses to investigate whether the change in specificity with CAD was statistically significant.
In addition to analyzing the change in the number of masses for which the LM rating increased above (or decreased below) the biopsy threshold of 2% with use of CAD, we also examined the number of masses for which CAD resulted in a substantial change in the LM rating. We defined a substantial change as an absolute value difference of greater than or equal to five between LM ratings with and without CAD. The substantial decreases and increases in the ratings of malignant and benign masses were examined. For each mass, we also averaged the changes in the LM ratings by the five radiologists and evaluated how CAD changes the average LM ratings for malignant and benign masses with one-sample t tests.
When an observer experiment is performed to investigate the effect of CAD on radiologists' decisions in a laboratory environment, there may be a concern that the radiologists may rely too heavily on the CAD system without adequately merging the computer output with their own judgment. To investigate whether this is the case, we estimated the correlation between the radiologists' readings with CAD and (a) their readings without CAD and (b) the computer scores. We then estimated the statistical significance of the difference between these two correlation coefficients by using the method described by Cohen and Cohen (27). If radiologists use the computer scores only when they believe that it makes a true contribution to their original assessment, then the correlation between the radiologists' readings with CAD and their readings without CAD should be significantly higher than the correlation between the radiologists' readings with CAD and the computer scores.
| RESULTS |
|---|
|
|
|---|
ROC Analysis
The Az values of the radiologists ranged from 0.81 to 0.87 without CAD and from 0.86 to 0.93 with CAD (Table 1). Radiologist 4 had the largest Az value change when reading with CAD: The Az value for this radiologist was 0.82 without CAD and 0.93 with CAD. The improvement in Az values was statistically significant for four of five radiologists.
|
value improved significantly (P = .017) from 0.30 to 0.44 (Table 2). Improvement in the Az and A
values was statistically significant (P < .01), even when radiologist 4who showed the largest improvement with CADwas excluded from the analysis.
|
|
Sensitivity and Specificity
On average, radiologist sensitivity increased from 96% to 98% with CAD; however, specificity decreased from 22% to 19% (Table 3). Sensitivity of three radiologists increased, while two radiologists maintained a sensitivity of 100%. The specificity of three radiologists decreased with CAD, the specificity of one radiologist increased, and the specificity of another did not change. Changes in sensitivity and specificity were not statistically significant for any radiologist (range of P values with the McNemar test, .157 to > .99 for sensitivity and .102 to > .99 for specificity). If the LM threshold was to be adjusted to 7% with CAD, the average sensitivity would remain at 96% (same as that without CAD) and the average specificity would increase to 45%. Under this condition, the improvement in specificity for four of five radiologists was statistically significant (P < .003, McNemar test), while the change in sensitivity for each radiologist was insignificant (Table 3).
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
values. During our observer experiment, 95 (94%) of the 101 masses were classified as solid according to majority rule. When analysis was limited to the subset of solid masses, the Az values derived with and without CAD and the significance of the improvement with CAD were essentially unchanged when compared with the results for the entire data set of 101 masses. This indicates that CAD would be helpful even if we only considered the interpretation of the more difficult category of solid masses.
The effect of CAD was mixed when measured in terms of the radiologists' sensitivity and specificity at the current threshold of biopsy recommendation (LM of 2%). With CAD, the average sensitivity of the five radiologists increased from 96% to 98%, while their average specificity for this data set decreased from 22% to 19%. The significant improvement in the ROC curves strongly suggests that these changes do not reflect only a shift in decision threshold along the same ROC curve. Although the changes in sensitivity and specificity were not statistically significant because of the relatively small data set available in this study, these observations indicate a promising trend that may be achieved with CAD.
Since the cost of failing to perform a biopsy for a malignant lesion is much greater than the cost of performing a biopsy for a benign lesion, it can logically be expected that radiologists may use the CAD system to confirm and increase their LM estimate for malignant lesions but not to decrease their LM estimate for low-suspicion lesions. This will result in an overall increase in radiologists' LM ratings, as observed in our study. While the ratings for malignant masses demonstrated a strong tendency to increase with CAD, the ratings for benign masses did not show a strong trend either way. These results led to an increase in sensitivity and a decrease in specificity. However, since the ROC curves of all radiologists improved with CAD, there is a chance that radiologists can adjust their decision thresholds along the higher ROC curves and thus increase both their sensitivity and their specificity. Alternatively, it may be possible to convince them to reduce the LM ratings of very-low-suspicion masses, as indicated by the CAD system, and thus improve the specificity. These improvements may be realized after radiologists accumulate experience and increase their confidence in the use of CAD.
Horsch et al (28) found that the accuracy of both expert mammographers and community radiologists improved significantly when they read 2D US images with CAD. Our study design differs from that used by Horsch et al (28) in that 3D US images were used, but our results reinforce their finding that experienced radiologists can benefit from reading US images with CAD.
The radiologists were not informed of the prevalence of cancer in the data set. However, they probably assumed that the prevalence of the disease was higher than that in the diagnostic population in clinical practice because most laboratory ROC studies are designed to have an approximately equal number of positive and negative cases in order to increase the statistical power for the same total number of cases read (23). Gur et al (29) found that no significant effects could be measured for prevalence in the range of 2%28% in laboratory ROC experiments. It is not known if their findings could be extended to a prevalence of nearly 50%. On the other hand, since ROC studies are usually performed to measure the relative performances of two modalities instead of their absolute performances in the patient population at large, the prevalence effects should be comparable for both modalities and would be unlikely to change the relative performances, as assumed in most laboratory ROC studies.
Our observations indicate that the radiologists were not overly reliant on computer ratings in this study. First, they did not change their LM rating substantially (ie, a change of five or more points on the 100-point scale) with CAD in 64% of the readings. Second, correlation analysis revealed that the LM ratings assigned by a radiologist with and without CAD were highly correlated, whereas the correlation between the computer scores and the radiologists' LM ratings with CAD was significantly lower for four readers. Third, before all the readings were completed, the radiologists did not receive any feedback regarding whether the computer rating was more accurate than their rating. Thus, they had no way to know that their accuracy would improve by simply following the computer rating.
Our study had a number of limitations. Our data set consisted of only masses that were recommended for core-needle biopsy, surgical biopsy, or fine-needle aspiration biopsy. It is therefore important to investigate the performance of the CAD system in the evaluation of masses that are not recommended for biopsy. Second, all studies in our data set were obtained with the same US machine; the CAD system needs to be evaluated with images acquired with different US imaging systems. Third, all the observers in our study were experienced in the interpretation of mammograms and US images; thus, the effects of CAD on less experienced radiologists were not studied. Fourth, the classifier in our CAD system was trained and tested by using a leave-one-case-out method, and the segmentation method was optimized by using a small subset of the data set. Although the leave-one-case-out resampling method is known to be a nearly unbiased classifier design method (19), the performance of our CAD system needs to be evaluated by using independent test sets to ensure the generalizability of our approach. Fifth, radiologists generally combine information from US images with information from mammograms to reach a diagnosis; however, we used only information from US images. Sixth, the components of retrospective ROC studies cannot emulate many factors that exist in clinical practice, such as the psychologic effects of the liability of misdiagnosing a malignant lesion.
We conclude that use of a well-trained computer algorithm may improve radiologists' accuracy in distinguishing malignant from benign breast masses on 3D US volumetric images.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: Az = area under the ROC curve CAD = computer-aided diagnosis LM = likelihood of malignancy ROC = receiver operating characteristic 3D = three-dimensional 2D = two-dimensional
Authors stated no financial relationship to disclose.
Author contributions: Guarantors of integrity of entire study, B.S., H.P.C.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, B.S., H.P.C., M.A.R., L.M.H.; clinical studies, M.A.R., C.P.; experimental studies, B.S., M.A.R., M.A.H., C.P., J.B., A.V.N., C.B.; statistical analysis, B.S., H.P.C., L.M.H.; and manuscript editing, B.S., H.P.C., M.A.R., M.A.H.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. P. Sinha, M. M. Goodsitt, M. A. Roubidoux, R. C. Booi, G. L. LeCarpentier, C. R. Lashbrook, K. E. Thomenius, C. L. Chalek, and P. L. Carson Automated Ultrasound Scanning on a Dual-Modality Breast Imaging System: Coverage and Motion Issues and Solutions J. Ultrasound Med., May 1, 2007; 26(5): 645 - 655. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |