Published online before print June 11, 2007, 10.1148/radiol.2442060712
(Radiology 2007;244:390-398.)
© RSNA, 2007
Breast Mass Lesions: Computer-aided Diagnosis Models with Mammographic and Sonographic Descriptors1
Jonathan L. Jesneck, PhD,
Joseph Y. Lo, PhD, and
Jay A. Baker, MD
1 From the Department of Biomedical Engineering (J.L.J., J.Y.L.) and Duke Advanced Imaging Labs, Department of Radiology (J.L.J., J.Y.L., J.A.B.), Duke University Medical Center, 2424 Erwin Rd, Suite 302, Durham, NC 27705. Received April 23, 2006; revision requested June 23; revision received July 24; accepted August 29; final version accepted November 15. Supported by U.S. Army Breast Cancer Research Program W81XWH-05-1-0292 and DAMD17-02-1-0373, and NIH/NCI R01 CA95061 and R21 CA93461.
Address correspondence to J.L.J. (e-mail: jonathan.jesneck{at}duke.edu).
 |
ABSTRACT
|
|---|
Purpose: To retrospectively develop and evaluate computer-aided diagnosis (CAD) models that include both mammographic and sonographic descriptors.
Materials and Methods: Institutional review board approval was obtained for this HIPAA-compliant study. A waiver of informed consent was obtained. Mammographic and sonographic examinations were performed in 737 patients (age range, 17–87 years), which yielded 803 breast mass lesions (296 malignant, 507 benign). Radiologist-interpreted features from mammograms and sonograms were used as input features for linear discriminant analysis (LDA) and artificial neural network (ANN) models to differentiate benign from malignant lesions. An LDA with all the features was compared with an LDA with only stepwise-selected features. Classification performances were quantified by using receiver operating characteristic (ROC) analysis and were evaluated in a train, validate, and retest scheme. On the retest set, both LDAs were compared with radiologist assessment score of malignancy.
Results: Both the LDA and ANN achieved high classification performance with cross validation (area under the ROC curve [Az] = 0.92 ± 0.01 [standard deviation] and 0.90Az = 0.54 ± 0.08 for LDA, Az = 0.92 ± 0.01 and 0.90Az = 0.55 ± 0.08 for ANN). Results of both models generalized well to the retest set, with no significant performance differences between the validate and retest sets (P > .1). On the retest set, there were no significant performance differences between LDA with all features and LDA with only the stepwise-selected features (P > .3) and between either LDA and radiologist assessment score (P > .2).
Conclusion: Results showed that combining mammographic and sonographic descriptors in a CAD model can result in high classification and generalization performance. On the retest set, LDA performance matched radiologist classification performance.
© RSNA, 2007
 |
INTRODUCTION
|
|---|
Because of low specificity at mammography, many women undergo unnecessary breast biopsy. As many as 65%–85% of breast biopsies are performed in benign lesions (1–3). Unnecessary biopsy not only increases the cost of mammographic screening (4) but also subjects patients to avoidable emotional and physical burdens.
To improve the accuracy of mammography, computer aids have become available to help radiologists detect (5–8) and diagnose (9–12) suspicious breast lesions. Some study results (13,14) have shown that use of such computer-aided diagnosis (CAD) systems has increased overall diagnostic sensitivity and specificity. Lesions determined to be very likely benign may be recommended for short-term follow-up rather than biopsy (13,14).
CAD models often involve breast morphologic descriptors of the Breast Imaging Reporting and Data System (BI-RADS) lexicon. BI-RADS was developed by the American College of Radiology to standardize the interpretation of mammograms (15–17). Originally, BI-RADS was applied to only mammography, but the crucial adjunct role of sonography has recently led the American College of Radiology to develop a BI-RADS lexicon for breast sonography as well (18). Sonographic BI-RADS is a useful tool to help standardize the characterization of sonographic lesions (18,19) and facilitate clinician communication.
Until recently, the primary clinical role for sonography has been to aid in distinguishing simple cysts from solid masses, as well as to direct aspirations, wire localizations, and biopsies. Several authors (20–24) have investigated the role of sonography in helping to differentiate malignant from benign breast lesions. There also have been many CAD studies (25–33) of breast sonography, which are based on image features automatically extracted by using computer vision algorithms. To the best of our knowledge, there has not yet been a published study with either the standardized BI-RADS sonographic findings as the basis of a predictive model or the combination of BI-RADS mammographic and sonographic findings for that purpose. Thus, the purpose of our study was to retrospectively develop and evaluate CAD models that involve both mammographic and sonographic descriptors.
 |
MATERIALS AND METHODS
|
|---|
Lesions and Patients
Institutional review board approval was obtained for this Health Insurance Portability and Accountability Act–compliant study. A waiver of informed consent was obtained. The lesions used in this study were an extension of an original 403-lesion data set described in detail in a previous study (34). They were collected between 2000 and 2005 at our institution. The data set included 803 lesions, of which 296 were malignant and 507 were benign, and 389 were palpable and 414 were nonpalpable. There were 737 patients whose ages ranged from 17 to 87 years, with a median age of 50 years. The same inclusion and exclusion criteria as described previously (34) applied to this data set. Lesions were selected from those recommended for biopsy and were included in the study if the lesions corresponded to solid masses on sonograms and if both mammographic and sonographic images taken before the biopsy were available for review. Any complicated cysts were excluded from consideration. All cases were re-reviewed by one of four breast radiologists (including J.A.B.) who were blinded to the original report.
Features Used
All patients underwent both mammography and sonography. The mammographic examination consisted of both craniocaudal and mediolateral oblique views, with additional true lateral and spot compression magnification when clinically appropriate. Sonographic images were acquired in both radial and antiradial projections, with and without caliper measurements. Additional gray-scale images were obtained in almost all patients to better depict the lesion. Doppler, color Doppler, and power Doppler images were not part of the routine imaging protocol but were reviewed when available. One of four dedicated breast radiologists (including J.A.B.) with 6–11 years of experience used BI-RADS lexicon to describe the lesions, as described previously (34). Information about patient physical examination findings, family history of breast cancer, and personal history of breast malignancy was available to each radiologist to reproduce a realistic clinical situation. The radiologist was blinded to the histologic diagnosis during the evaluation.
Of the total 39 features, 13 were mammographic BI-RADS features, 13 were sonographic BI-RADS features, six were sonographic features suggested by Stavros et al (20), four were other sonographic features, and three were patient history features. The 13 mammographic BI-RADS features were mass size, parenchyma density, mass margin, mass shape, mass density, calcification number of particles, calcification distribution, calcification description, architectural distortion, associated findings, special cases (as defined by the BI-RADS lexicon: asymmetric tubular structure, intramammary lymph node, global asymmetry, and focal asymmetry), comparison with findings at prior examination, and change in mass size. The 13 sonographic BI-RADS features were radial diameter, antiradial diameter, anteroposterior diameter, background tissue echo texture, mass shape, mass orientation, mass margin, lesion boundary, echo pattern, posterior acoustic features, calcifications within mass, special cases (as defined by the BI-RADS lexicon: clustered microcysts, complicated cysts, mass in or on skin, foreign body, intramammary lymph node, and axillary lymph node), and vascularity. The six features suggested by Stavros et al (20) were mass shape, mass margin, acoustic transmission, thin echo pseudocapsule, mass echogenicity, and calcifications. The four other sonographic mass descriptors were edge shadow, cystic component, and two mammographic BI-RADS descriptors applied to sonography—mass shape (oval and lobulated are separate descriptors) and mass margin (replaces sonographic descriptor angular with obscured). The three patient history features were family history, patient age, and indication for sonography.
In addition to the BI-RADS and Stavros et al descriptors, the radiologists also recorded their assessment about the malignancy of the lesion as an integer ranging from 0 for unquestionably benign to 100 for unquestionably malignant. This assessment rating was not used as an input to the CAD models but rather as a comparison to the models' output for classification performance.
Predictive Modeling, Sampling, and Feature Selection
For models in this study, we (J.L.J. and J.Y.L. by consensus) used linear discriminant analysis (LDA) and artificial neural networks (ANNs). The LDA was a Fisher linear discriminant. The ANNs were three-layer (one hidden layer), feed-forward, and error back-propagation models. These are the most common methods used in many previous studies by our group, as well as the rest of the field.
To assess the usefulness and risk of using CAD models in the clinic, it is crucial to have a good estimate of their performance in future cases (or generalization). For limited data and more complicated models, the traditional method of cross validation could still pose a danger of optimistically biasing the testing performance; it is common to optimize certain global parameters (such as feature selection for the LDA or number of hidden nodes of an ANN) to maximize cross-validation performance. With cross validation, one is able to use knowledge of all the data to make modeling decisions, whereas with generalization such information is not available for yet unseen future cases. Therefore, optimizing the models for cross-validation performance could lead to reduced generalization performance.
To avoid these overfitting pitfalls and to better estimate generalization ability of each model, we used a train, validate, and retest scheme. In this scheme, the data set was divided into sets: a train and validate set and a retest set. The retest set was not used until the models were finalized, so as not to influence any of the modeling process. All modeling decisions were made only on the train and validate set. The model parameters were optimized to maximize cross validation on the train and validate set. Once the model's parameter values were set, the model was then trained on the entire train and validate set. The trained model was then applied to the retest set.
In particular, for our data set of 803 lesions, we chose the first 500 lesions in chronologic order for the train and validate set and the remaining 303 lesions for the retest set. We chose architecture and parameter settings for the ANN to optimize its cross-validation performance on the train and validate set. Once the modeling decisions had been made, we trained the LDA and ANN on all the lesions in the train and validate set to determine a single, final set of weights, which were then applied to the retest set.
In addition to aiding model training and assessment, the train, validate, and retest scheme can also reduce bias in feature selection. Using this scheme, we investigated the effect of feature selection on the generalization performance of an LDA. Using only the validate set, we performed stepwise feature selection. We then used these selected features to train an LDA on the train and validate set. We then applied the trained LDA model to the retest set. Finally, on the retest set, we compared the generalization performance of the LDA with only the stepwise-selected features and that of the LDA with all the features.
Classifier Performance Evaluation and Statistical Analysis
To use the LDA or ANN model as a diagnostic aid, one could select a threshold value, so that lesions with output values less than the threshold would be considered very likely benign and therefore candidates for follow-up rather than biopsy. Those lesions with model outputs greater than the threshold would be considered suspicious for malignancy and recommended for biopsy.
Varying the threshold value results in a trade-off between sensitivity and specificity. The entire range of sensitivity and specificity values for a classifier is illustrated by using the receiver operating characteristic (ROC) curve (35,36). To quantify a classifier's performance, we (J.L.J. and J.Y.L. by consensus) used the following five summary measures of the ROC curve: area under the ROC curve (Az), the partial area (0.90Az), and the specificity, positive predictive value, and negative predictive value for a given sensitivity level. Az represents the average specificity across all sensitivities and ranges from 0.5 (chance performance) to 1.0 (perfect performance). Because high sensitivity is essential for a classification task, a more relevant performance measure is 0.90Az, which represents the average specificity performance of the classifier at sensitivities from 90% to 100%.
Whereas the two previous measures provide an overall summary of performance, the remaining three are clinically relevant measures that correspond to a single threshold value, which for breast cancer applications is usually chosen to deliver nearly perfect sensitivity, such as 98% (37,38). Note that for this data set, the actual positive predictive value of the clinical decision to refer to biopsy was 37%, which is typical of our institution. Because our study included only biopsy-verified lesions, sensitivity was 100% and specificity was 0% for cancer detection by definition.
These classifier performance metrics allowed us to compare classifier performance statistically. We used the nonparametric bootstrap method (39) to measure the means and variances of the classification metric values, as well as to compare metric results of the two models for statistical significance. Although we assumed statistical independence of the lesions for modeling, 8% (66 of 803) of the BI-RADS data set included multiple lesions per patient. To adjust for clustering of data values, we used cross validation by patient, which ensured that no lesions from the same patient appeared in more than one of the train, validate, and retest sets. A P value of less than .05 was considered to indicate a significant difference.
 |
RESULTS
|
|---|
Generalization between Validating and Retesting
The LDA achieved high classification performance, with Az = 0.92 ± 0.01 and 0.90Az = 0.54 ± 0.08 on the train and validate set and Az = 0.92 ± 0.02 and 0.90Az = 0.52 ± 0.08 on the retest set (Table 1). Results of the LDA generalized well; there were no significant differences between the performance metric results of the validate set and those of the retest set (P > .10). In addition to the entire ROC curves of the LDA performance, results with individual thresholds also generalized well. The same threshold value determined similar true-positive fraction (sensitivity) and false-positive fraction (1 – specificity) operating points in the high-sensitivity region on both ROC curves (Table 2).
The ANN also performed well, achieving Az = 0.92 ± 0.01 and 0.90Az = 0.55 ± 0.08 on the validate set and Az = 0.91 ± 0.02 and 0.90Az = 0.57 ± 0.06 on the retest set. The ANN performed comparably on the validate and retest sets, with no significant differences in either metric (P > .10).
Comparison of LDA and ANN Performance
The two types of models, LDA and ANN, had similar performances on both the validate and retest sets; the differences were not significant (P > .10). In the interest of brevity, tables with the results of ANN performance are not included in our study because of their close similarity to tables with LDA performance results. ROC curves for the LDA and ANN in both testing paradigms (Fig 1) showed that discrepancies among the curves were minor, and the curves overlap each other with essentially indistinguishable classification performance.

View larger version (19K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 1a: (a) Full ROC curves for classifier performance: validate set versus retest set. (b) Partial ROC curves for classifier performance: cross validation versus retest set. Results of LDA and ANN generalized well on retest data set, as shown by their overlapping ROC curves. Validation ROC curves (solid curves) lie close to retest ROC curves (dashed curves). LDA and ANN had virtually indistinguishable classification performances. FPF = false-positive fraction, TPF = true-positive fraction.
|
|

View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 1b: (a) Full ROC curves for classifier performance: validate set versus retest set. (b) Partial ROC curves for classifier performance: cross validation versus retest set. Results of LDA and ANN generalized well on retest data set, as shown by their overlapping ROC curves. Validation ROC curves (solid curves) lie close to retest ROC curves (dashed curves). LDA and ANN had virtually indistinguishable classification performances. FPF = false-positive fraction, TPF = true-positive fraction.
|
|
Feature Selection and Generalization of Simplified Model
Performance of stepwise feature selection for the LDA resulted in the following 14 features: patient age, calcification distribution, calcification description, associated findings, comparison with findings at prior examination, anteroposterior diameter, indication for sonography, Stavros et al mass shape, mammographic BI-RADS mass margin, edge shadow, cystic component, sonographic lesion boundary, surrounding tissue effects, and sonographic special findings. An LDA with only these stepwise-selected features performed comparably to the LDA with all the features, with no significant difference (Az = 0.92 ± 0.02 vs 0.91 ± 0.02, respectively; P > .3). A table with performance results of the LDA with stepwise-selected features was not included in our study because of its close similarity to the table with results of the LDA with all features.
Comparing LDA to Radiologist Assessment of Malignancy
Like the LDA, radiologist assessment also achieved high classification performance on the retest set (Table 3), with Az = 0.92 ± 0.02 and 0.90Az = 0.52 ± 0.06 on the retest set. There were no significant differences between any of the performance metric results of the LDA and radiologist assessment (P > .2). For example, on this retest data set, the LDA and radiologists performed with similar negative predictive values (97% ± 1 vs 98% ± 1, respectively; P = .25).
With regard to ROC curves for the LDA with all features, the LDA with the stepwise-selected features, and radiologist assessment of malignancy (Fig 2), there were no significant differences in any of the performance metric results among the three ROC curves (P > .2). Although the radiologist curve crossed the LDA curves several times, even at the points of greater divergence, the differences were not significant (P > .2). In a lesion in which the LDA and radiologist disagreed (Fig 3), the LDA correctly classified the lesion as benign.

View larger version (19K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 2a: (a) Full ROC curves: LDA versus radiologist on retest set. (b) Partial ROC curves: LDA versus radiologist on retest set. ROC curves for LDA with all features, for LDA with stepwise-selected features, and for radiologist assessment of malignancy. In retesting, LDA, both with all features and with only stepwise-selected features, performed similarly to radiologists. There were no significant differences in any performance metric results among the three ROC curves (P > .2). Although the radiologist curve crossed LDA curves several times, even at points of greater divergence, differences were not significant (P > .2). FPF = false-positive fraction, TPF = true-positive fraction.
|
|

View larger version (21K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 2b: (a) Full ROC curves: LDA versus radiologist on retest set. (b) Partial ROC curves: LDA versus radiologist on retest set. ROC curves for LDA with all features, for LDA with stepwise-selected features, and for radiologist assessment of malignancy. In retesting, LDA, both with all features and with only stepwise-selected features, performed similarly to radiologists. There were no significant differences in any performance metric results among the three ROC curves (P > .2). Although the radiologist curve crossed LDA curves several times, even at points of greater divergence, differences were not significant (P > .2). FPF = false-positive fraction, TPF = true-positive fraction.
|
|

View larger version (138K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 3a: (a) Mediolateral oblique mammogram in 26-year-old patient demonstrates ill-defined, oval-shaped, equal-density mass (arrow) in posterior left breast. Radiopaque marker immediately anterior to mass indicates that this mass was palpable. (b) Sonogram in same patient demonstrates oval, circumscribed mass (arrow) with parallel orientation and no posterior acoustic features. Histopathologic diagnosis indicated that this lesion was necrotic breast tissue. Follow-up examination findings confirmed no interval change 2 years after biopsy. LDA considered this lesion relatively benign, with a score of 0.33 of 1.00, whereas radiologist considered it more indicative of malignancy, with a score of 85 of 100.
|
|

View larger version (141K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 3b: (a) Mediolateral oblique mammogram in 26-year-old patient demonstrates ill-defined, oval-shaped, equal-density mass (arrow) in posterior left breast. Radiopaque marker immediately anterior to mass indicates that this mass was palpable. (b) Sonogram in same patient demonstrates oval, circumscribed mass (arrow) with parallel orientation and no posterior acoustic features. Histopathologic diagnosis indicated that this lesion was necrotic breast tissue. Follow-up examination findings confirmed no interval change 2 years after biopsy. LDA considered this lesion relatively benign, with a score of 0.33 of 1.00, whereas radiologist considered it more indicative of malignancy, with a score of 85 of 100.
|
|
Histograms of the LDA output and radiologist assessment values for the retest set (Fig 4) showed that the values for the benign lesions (such as in Fig 5) tended to be on the left side of the histogram plot with values around zero. Values for the malignant lesions (such as in Fig 6) were concentrated on the right side of the plots, around 1 for the LDA values and around 100 for radiologist assessment values. There were few values in the center regions compared with those on the extremes.

View larger version (17K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 4a: Histograms of (a) LDA output values and (b) radiologist assessment. Histogram counts for truly benign lesions are shown in gray, and those for truly malignant lesions are shown in black. For classification, a threshold would be applied to LDA output, so that output values below the threshold would be designated benign and those above it would be designated malignant.
|
|

View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 4b: Histograms of (a) LDA output values and (b) radiologist assessment. Histogram counts for truly benign lesions are shown in gray, and those for truly malignant lesions are shown in black. For classification, a threshold would be applied to LDA output, so that output values below the threshold would be designated benign and those above it would be designated malignant.
|
|

View larger version (85K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 5a: (a) Mediolateral oblique mammogram in 52-year-old patient demonstrates oval, well-circumscribed, equal-density mass (arrow) in superior left breast. (b) Sonogram in same patient demonstrates oval, hypoechoic solid mass (arrow) with circumscribed margins, parallel orientation, and posterior acoustic shadowing. Histopathologic results indicated benign fibroadenoma. Both LDA and radiologist correctly considered this lesion very benign, with scores of 0.02 of 1.00 and 0 of 100, respectively.
|
|

View larger version (142K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 5b: (a) Mediolateral oblique mammogram in 52-year-old patient demonstrates oval, well-circumscribed, equal-density mass (arrow) in superior left breast. (b) Sonogram in same patient demonstrates oval, hypoechoic solid mass (arrow) with circumscribed margins, parallel orientation, and posterior acoustic shadowing. Histopathologic results indicated benign fibroadenoma. Both LDA and radiologist correctly considered this lesion very benign, with scores of 0.02 of 1.00 and 0 of 100, respectively.
|
|

View larger version (72K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 6a: (a) Mediolateral oblique mammogram in 57-year-old patient demonstrates ill-defined, irregularly shaped, equal-density mass (arrow) in superior right breast. (b) Sonogram in same patient demonstrates ill-defined, irregularly shaped mass (arrow) with posterior acoustic shadowing and without parallel orientation. Histopathologic diagnosis indicated that this malignant lesion was invasive ductal carcinoma. Both LDA and radiologist correctly considered this lesion very malignant, with scores of 0.99 of 1.00 and 95 of 100, respectively.
|
|

View larger version (170K):
[in this window]
[in a new window]
[Download PPT slide]
|
Figure 6b: (a) Mediolateral oblique mammogram in 57-year-old patient demonstrates ill-defined, irregularly shaped, equal-density mass (arrow) in superior right breast. (b) Sonogram in same patient demonstrates ill-defined, irregularly shaped mass (arrow) with posterior acoustic shadowing and without parallel orientation. Histopathologic diagnosis indicated that this malignant lesion was invasive ductal carcinoma. Both LDA and radiologist correctly considered this lesion very malignant, with scores of 0.99 of 1.00 and 95 of 100, respectively.
|
|
ROC curves for generalization performance (Fig 2) suggest that radiologists may be able to achieve considerable improvements in performance by shifting their diagnostic performance to a more desirable operating point on the ROC curve. For example, they may perform at 52% specificity, 60% positive predictive value, and 98% negative predictive value by adjusting their mental threshold to reduce their sensitivity slightly to 98% sensitivity, which would have resulted in the delayed diagnosis of 2% of cancers that may be identified by using interval change at a short-term follow-up diagnostic study. Likewise, if the radiologists were hypothetically to adopt all the recommendations of the computer model, they could have perhaps attained 37% specificity, 53% positive predictive value, and 97% negative predictive value at that same 98% sensitivity level.
 |
DISCUSSION
|
|---|
To the best of our knowledge, our study is the first CAD study not only to use sonographic BI-RADS features but also to combine BI-RADS features of sonography with those of mammography. In addition, to justify the clinical use of a CAD system on new patients, it is important to estimate its generalization performance. We have estimated the generalization performance of both LDA and ANN models on our data set by using a train, validate, and retest scheme on our data set. There was good evidence of generalization for the LDA and ANN because there was no decrease in performance from the validation curves to the retest curves.
The LDA and ANN had virtually indistinguishable classification performance, which indicated that the BI-RADS data were highly linear. In general, such results would support the use of the LDA model, which is simpler than the nonlinear ANN and therefore less likely to be susceptible to overtraining problems. Our study results, however, demonstrated that there were no problems with overtraining, as both models performed similarly during the retesting phase.
Because CAD systems typically give as output a range of values, applying a certain threshold to the output determines the operating point (sensitivity and specificity settings) at which the clinical decision is made. Knowing the CAD operating point helps the clinician incorporate it into an overall diagnostic decision. We have shown that results with LDA thresholds from the validation ROC curve generalized well to the retest ROC curve in the clinically important high-sensitivity region, which suggests that these threshold values could be used clinically with the LDA on future lesions.
Because the task of collecting many features can be cumbersome, we investigated CAD performance with only a subset of the features by performing stepwise feature selection. Of the 14 selected features, three also had been found to have high malignancy predictive value from a previous study (34): Stavros et al mass shape, mammographic mass margin, and sonographic lesion boundary. To ensure that the selected features were adequate to allow the CAD system to achieve good generalization on new lesions, a train, validate, and retest scheme was required. Only the train and validate set was used to select the features, which were then tested in a CAD model on the retest set. LDA with only the 14 stepwise-selected features performed just as well as an LDA with all 37 features. The small number of features required for good performance suggests that this CAD model may be able to offer the benefit of having a second reader without greatly slowing workflow. Similar feature definitions caused some features to be collinear. While data collinearity does potentially bias the selected model to be optimistic, the rigorous use of cross validation followed by completely independent retesting demonstrated that the results did generalize without optimistic bias.
LDA results distinguished benign from malignant lesions no differently than did radiologist assessment scores for our data set. The generalization performance results suggest that radiologists may be able to achieve a more desirable operating point on their ROC curve by adjusting their mental threshold to have slightly lower sensitivity but much higher specificity. If the radiologists were to adopt all the recommendations of the computer model, they could substantially increase specificity while maintaining a high sensitivity level.
The radiologists in this study were experienced dedicated breast imagers. It is hoped that less-specialized radiologists using such a system could improve their diagnostic performance closer to that of breast specialists. In practice, it remains to be determined how radiologists would use the results from such computer models, in particular whether they would modify a recommendation for biopsy to a recommendation for short-term follow-up in those lesions deemed to be very likely benign. It also remains unknown whether the 2% of cancers mistakenly referred to follow-up would prove to remain early stage, such as with the current clinical practice of following up probably benign lesions.
There were limitations to our study. The BI-RADS data collection included multiple lesions per patient for 66 of 803 lesions. Our study criteria included solid masses rather than cysts and the use of only biopsy-proved lesions. Additionally, radiologists allowed mammograms to influence their recording of the sonographic features, because they analyzed mammograms immediately before sonograms. The study was organized in this manner to better reflect actual clinical practice in which the mammogram is obtained immediately prior to the sonogram and decisions are made by using all available data. They also could have shifted their diagnostic sensitivity and specificity levels from their usual clinical levels because they were aware that the lesion diagnoses had been resolved, and therefore, their assessment ratings did not directly affect patient care.
In conclusion, the results of model classification and generalization performance on our data set suggest that the models could be used as a CAD system for future mass lesions. Because the results with LDA threshold values generalized well, the desired operating point on the ROC curve could be set for future lesions, which increases the usefulness of the CAD system. Because the stepwise-selected features were adequate for good classification and generalization, they could be used in a CAD system that would require only minimal feature collection. In our study, we were not trying to improve diagnostic accuracy of dedicated breast imagers but rather to offer a tool to radiologists to allow a substantial decrease in the number of unnecessary benign breast biopsies while minimizing the number of delayed breast cancer diagnoses.
 |
ADVANCES IN KNOWLEDGE
|
|---|
- Both mammographic and sonographic Breast Imaging Reporting and Data System descriptors are useful in a computer-aided diagnosis (CAD) system for differentiating malignant from benign breast masses with high performance (area under the receiver operating characteristic curve, 0.92 ± 0.02).
- Results with this multimodal CAD system generalized well to new lesions, an important step for the consideration of incorporating a CAD system into clinical use.
 |
ACKNOWLEDGMENTS
|
|---|
We thank David DeLong, PhD, for help with statistical analysis, Brian Harrawood, BA, for the ROC bootstrap code, Carey Floyd, Jr, PhD (deceased), and Georgia Tourassi, PhD, for insightful discussions, and Andrea Hong, MD, and Priscilla Chyn, MD, for data collection.
 |
FOOTNOTES
|
|---|
Abbreviations: ANN = artificial neural network Az = area under the ROC curve BI-RADS = Breast Imaging Reporting and Data System CAD = computer-aided diagnosis LDA = linear discriminant analysis ROC = receiver operating characteristic
Authors stated no financial relationship to disclose.
Author contributions: Guarantors of integrity of entire study, J.L.J., J.Y.L.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, J.L.J.; clinical studies, J.A.B.; statistical analysis, J.L.J., J.Y.L.; and manuscript editing, all authors
 |
References
|
|---|
- Kopans DB. The positive predictive value of mammography. AJR Am J Roentgenol 1992;158:521–526.[Free Full Text]
- Ciatto S, Cataliotti L, Distante V. Nonpalpable lesions detected with mammography: review of 512 consecutive cases. Radiology 1987;165:99–102.[Abstract/Free Full Text]
- Meyer JE, Eberlein TJ, Stomper PC, Sonnenfeld MR. Biopsy of occult breast lesions: analysis of 1261 abnormalities. JAMA 1990;263:2341–2343.[Abstract]
- Cyrlak D. Induced costs of low-cost screening mammography. Radiology 1988;168:661–663.[Abstract/Free Full Text]
- Warren Burhenne LJ, Wood SA, D'Orsi CJ, et al. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 2000;215:554–562.[Abstract/Free Full Text]
- Zheng B, Chang YH, Wang XH, Good WF, Gur D. Feature selection for computerized mass detection in digitized mammograms by using a genetic algorithm. Acad Radiol 1999;6:327–332.[CrossRef][Medline]
- Qian W, Clarke LP, Song D, Clark RA. Digital mammography: hybrid four-channel wavelet transform for microcalcification segmentation. Acad Radiol 1998;5:354–364.[CrossRef][Medline]
- Qian W, Li L, Clarke L, Clark RA, Thomas J. Digital mammography: comparison of adaptive and nonadaptive CAD methods for mass detection. Acad Radiol 1999;6:471–480.[CrossRef][Medline]
- Chan HP, Sahiner B, Helvie MA, et al. Improvement of radiologists' characterization of mammographic masses by using computer-aided diagnosis: an ROC study. Radiology 1999;212:817–827.[Abstract/Free Full Text]
- Chan HP, Sahiner B, Lam KL, et al. Computerized analysis of mammographic microcalcifications in morphological and texture feature spaces. Med Phys 1998;25:2007–2019.[CrossRef][Medline]
- Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999;6:22–33.[CrossRef][Medline]
- Huo Z, Giger ML, Vyborny CJ, Wolverton DE, Schmidt RA, Doi K. Automated computerized classification of malignant and benign masses on digitized mammograms. Acad Radiol 1998;5:155–168.[CrossRef][Medline]
- Baker JA, Kornguth PJ, Lo JY, Williford ME, Floyd CE Jr. Breast cancer: prediction with artificial neural network based on BI-RADS standardized lexicon. Radiology 1995;196:817–822.[Abstract/Free Full Text]
- Lo JY, Baker JA, Kornguth PJ, Floyd CE Jr. Effect of patient history data on the prediction of breast cancer from mammographic findings with artificial neural networks. Acad Radiol 1999;6:10–15.[CrossRef][Medline]
- Kopans DB. Standardized mammography reporting. Radiol Clin North Am 1992;30:257–264.[Medline]
- D'Orsi CJ, Kopans DB. Mammographic feature analysis. Semin Roentgenol 1993;28:204–230.[CrossRef][Medline]
- American College of Radiology. Breast Imaging-Reporting and Data System (BI-RADS). 3rd ed. Reston, Va: American College of Radiology, 1998.
- American College of Radiology. Ultrasound. In: Breast Imaging-Reporting and Data System atlas (BI-RADS atlas). 4th ed. Reston, Va: American College of Radiology, 2003.
- Mendelson EB, Berg WA, Merritt CR. Toward a standardized breast ultrasound lexicon, BI-RADS: ultrasound. Semin Roentgenol 2001;36:217–225.[CrossRef][Medline]
- Stavros AT, Thickman D, Rapp CL, Dennis MA, Parker S, Sisney G. Solid breast nodules: use of sonography to distinguish between benign and malignant lesions. Radiology 1995;196:123–134.[Abstract/Free Full Text]
- Rahbar G, Sie AC, Hansen GC, et al. Benign versus malignant solid breast masses: US differentiation. Radiology 1999;213:889–894.[Abstract/Free Full Text]
- Jackson VP. The role of US in breast imaging. Radiology 1990;177:305–311.[Free Full Text]
- Jackson VP. Management of solid breast nodules: what is the role of sonography? Radiology 1995;196:14–15.[Free Full Text]
- Zonderland HM, Coerkamp EG, Hermans J, van de Vijver MJ, van Voorthuisen AE. Diagnosis of breast cancer: contribution of US as an adjunct to mammography. Radiology 1999;213:413–422.[Abstract/Free Full Text]
- Chang RF, Kuo WJ, Chen DR, Huang YL, Lee JH, Chou YH. Computer-aided diagnosis for surgical office-based breast ultrasound. Arch Surg 2000;135:696–699.[Abstract/Free Full Text]
- Chen D, Chang RF, Huang YL. Breast cancer diagnosis using self-organizing map for sonography. Ultrasound Med Biol 2000;26:405–411.[CrossRef][Medline]
- Giger ML. Computerized analysis of images in the detection and diagnosis of breast cancer. Semin Ultrasound CT MR 2004;25:411–418.[CrossRef][Medline]
- Horsch K, Giger ML, Vyborny CJ, Venta LA. Performance of computer-aided diagnosis in the interpretation of lesions on breast sonography. Acad Radiol 2004;11:272–280.[CrossRef][Medline]
- Drukker K, Giger ML, Vyborny CJ, Mendelson EB. Computerized detection and classification of cancer on breast ultrasound. Acad Radiol 2004;11:526–535.[CrossRef][Medline]
- Drukker K, Horsch K, Giger ML. Multimodality computerized diagnosis of breast lesions using mammography and sonography. Acad Radiol 2005;12:970–979.[CrossRef][Medline]
- Drukker K, Giger ML, Metz CE. Robustness of computerized lesion detection and classification scheme across different breast US platforms. Radiology 2005;237:834–840.[Abstract/Free Full Text]
- Moon WK, Chang RF, Chen CJ, Chen DR, Chen WL. Solid breast masses: classification with computer-aided analysis of continuous US images obtained with probe compression. Radiology 2005;236:458–464.[Abstract/Free Full Text]
- Chen DR, Chang RF, Chen CJ, et al. Classification of breast ultrasound images using fractal feature. Clin Imaging 2005;29:235–245.[CrossRef][Medline]
- Hong AS, Rosen EL, Soo MS, Baker JA. BI-RADS for sonography: positive and negative predictive values of sonographic features. AJR Am J Roentgenol 2005;184:1260–1265.[Abstract/Free Full Text]
- Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978;8:283–298.[Medline]
- Metz C. Evaluation of CAD methods. In: Doi K, MacMahon H, Giger ML, Hoffmann KR, eds. Computer-aided diagnosis in medical imaging. Amsterdam, the Netherlands: Elsevier Science, 1998; 543–554.
- Zhou XH, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. New York, NY: Wiley, 2002.
- Jiang Y, Metz CE, Nishikawa RM. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology 1996;201:745–750.[Abstract/Free Full Text]
- Efron B, Tibshirani RJ. An introduction to the bootstrap. New York, NY: Chapman & Hall, 1993.