|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Breast Imaging |
1 From American College of Radiology Imaging Network, American Radiology Services, Johns Hopkins Green Spring, 10755 Falls Rd, Lutherville, MD 21093 (W.A.B.); Center for Statistical Sciences, Brown University, Providence, RI (J.D.B., J.B.C.); Department of Radiology, Northwestern University School of Medicine, Chicago, Ill (E.B.M.); and Department of Medical Physics, University of Wisconsin, Madison, Wis (E.L.M.). Received June 25, 2005; revision requested August 16; revision received September 1; final version accepted September 21. Supported by grants from the National Cancer Institute (CA89008) and the Avon Foundation. Address correspondence to W.A.B. (e-mail: wendieberg{at}hotmail.com).
| ABSTRACT |
|---|
|
|
|---|
Materials and Methods: National Cancer Institute Cancer Experimental Therapeutic Protocol review and ACRIN internal institutional review board approved the protocol; potential investigators were informed of the study purpose prior to participation. Six equivalent anthropomorphic phantoms were prepared with 17 masses (210 mm in mean diameter) in different locations at different depths. Sixty-six investigators, experienced in breast US, from 23 institutions scanned a phantom with high-frequency linear-array transducers (12-5 MHz). Lesion location, diameters, echogenicity, shape, and posterior features were recorded. Reader-specific phantom maps were generated and compared with known lesion locations and features. Results from 64 observers could be analyzed and were masked to investigator identity. Agreement on US features was measured with
statistics. A generalized linear model generated log relative risks for detection rates as a function of lesion diameter, depth, and features.
Results: Of 17 lesions, a median of 14 (82%) were detected (range, 916), and 86% of observers detected at least 12 lesions. Of 1088 potential detections, 861 (79.1%) were made. Among 510-mm lesions, 499 (97.5%) of 512 detections were made (excluding a 6-mm "skin" lesion seen by only seven observers [11%]). One 4-mm mass was seen by 53 observers (83%). Among 3-mm lesions, 274 (71.4%) of 384 detections were made. One 2-mm lesion was seen by 28 (44%) observers. Relative risk of detection decreased to 0.55 (95% confidence interval: 0.51, 0.59) for each centimeter increase in lesion depth. Agreement was slight for lesion shape (
= 0.14), substantial for echogenicity (
= 0.61), and moderate for posterior features (
= 0.45). Feature description errors were common for 24-mm lesions; only 33% of 3-mm anechoic masses were so characterized. Among eight 610-mm lesions, investigators erred in feature description of a median of 1 lesion (mean, 1.3; range, 04).
Conclusion: US detection and description of lesions in a breast phantom were highly consistent for lesions 510 mm in diameter; those smaller than 5 mm were less reliably identified or characterized by experienced investigators.
© RSNA, 2006
| INTRODUCTION |
|---|
|
|
|---|
Supplemental screening ultrasonography (US) has been shown to depict small invasive cancers not seen at mammography in dense breast tissue. Across multiple single-center trials totaling 42 838 examinations (4,812), 150 (0.35%) cancers were identified only at US in 126 women with average risk. The detection benefit was nearly all observed in heterogeneously dense or extremely dense breast tissue; over 90% of the women with cancers seen only at US had mammographic tissue density in those categories. Of the 150 cancers, 94% were invasive, with a mean size of 911 mm across the series (4,812). When staging was detailed, over 90% were node-negative. The generalizability of supplemental screening US is now the subject of a multicenter randomized trial, the American College of Radiology Imaging Network (ACRIN) protocol 6666 (13).
To participate as investigators in ACRIN protocol 6666, radiologist investigators were required to successfully complete several qualification tasks, including phantom scanning, training in the Breast Imaging Reporting and Data System (BI-RADS) for US (14), and interpretation of proved sets of US and mammographic images. Thus, the purpose of our study was to prospectively evaluate US lesion detection and characterization in a breast phantom by potential investigators in a screening US protocol, ACRIN 6666.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
Investigators were asked to report which phantom was scanned, lesion location (x and y alphanumeric coordinates and depth from the surface) to the nearest 0.5 cm, size (in all three perpendicular planes) to the nearest 0.1 mm, shape (either oval-round or irregular), echogenicity (anechoic, hyperechoic, complex cystic, isoechoic, hypoechoic), and posterior features (none, enhancement, shadowing, combined shadowing and enhancement). A nonradiologist assistant was allowed to record data while the investigator scanned the phantom. The data collection form provided space to detail 16 lesions.
Investigators were encouraged to identify as many lesions as possible. While there was no specific time constraint, the majority of investigators performed this task as part of a 2-day training session in the protocol at Northwestern University Medical Foundation (Chicago, Ill), and spending more than 1 hour performing this task was not realistic given the overall timing of the course.
Results were analyzed and masked to investigator identity according to a protocol approved by National Cancer Institute Cancer Experimental Therapeutics Protocol review and the ACRIN internal institutional review board. Potential investigators were informed of the purpose of the study prior to consenting to participate, and individuals had the option to not attempt to qualify as investigators for the ACRIN 6666 protocol. Phantom maps were created (J.B.C.) by using S-Plus 6.2 (Insightful, Seattle, Wash), with an overlay of the investigator's results and the known location, mean size, echogenicity, and posterior features of lesions in that phantom (Fig 1). Additional features were available in an Excel (Microsoft, Redmond, Wash) database and were occasionally needed to match lesions.
|
Two of 66 investigators submitted identical data sets; both were discarded. Thus, our results relate to 64 investigators. Those investigators who did not qualify at the initial attempt were asked to rescan the phantom. We examined errors in detection for all lesions as a function of size, depth, and features. We evaluated feature description by investigators (echogenicity, shape, or mean size incorrect by more than 1 mm) for lesions 6 mm in size or larger against the designed characteristics of each lesion. Inaccuracies in lesion location were also analyzed.
Statistical Methods
Data were analyzed at the Center for Statistical Sciences at Brown University (J.D.B. and J.B.C.), which serves as the biostatistics center for all ACRIN trials. Data were prospectively cleaned and monitored in a collaborative effort with ACRIN data management staff located at the American College of Radiology (Philadelphia, Pa). Statistical software SAS (version 8.0; SAS Institute, Cary, NC) and Stata (version 7.0; Stata, College Station, Tex) were used to process the data and facilitate statistical analyses. Initially, summary tables and simple frequencies were used to explore the data and check for outliers.
A primary objective was to estimate the reliability with which readers could identify lesion characteristics such as shape, size, depth, echogenicity, and posterior features. We tabulated the percentage of readers who correctly identified shape, echogenicity, and posterior features, and assessed their reliability by using the well-known
statistic and generalized
statistic (18), both of which account for agreement due to chance. We present an "average reader-specific
" for agreement with the reference standard, which is simply the average of each of the individual reader's
values against the reference standard for the feature of interest. We accounted for the natural clustering of outcomes by using Huber-White robust standard errors or by including a random effect in our regression models.
For continuous features, such as size and depth, we report the average error in determination of lesion size and depth; a repeated measures regression model for size and depth was constructed to assess this measurement error. In addition, we modeled the detection rate as a function of these lesion characteristics by using a generalized linear model with binomial errors and log link function. The coefficients in this model are log relative risks, as opposed to the more standard log odds ratios obtained from logistic regressions. The advantage of our approach is that we can directly assess the relative risks, which are the true quantity of interest. (Odds ratios only approximate the relative risk.) This model was fit by using an iteratively reweighted least-squares algorithm designed to minimize the deviance function. Huber-White standard errors were used to account for clustering within a phantom.
We also sought to identify reader characteristics, such as experience, that were associated with poor or good performance. Readers were asked how many years they had practiced breast imaging (<2, 25, 610, or >10 years), what percentage of time they spent in breast imaging (by quartiles), who performs breast imaging in their practices (themselves, resident or fellow and then themselves, technologist, or technologist and then themselves), number of mammograms read per week, number of breast US examinations performed per week, and whether they were currently performing whole-breast US for screening, only for extent of disease, or neither. Regression models included these variables to assess their effect. Some analyses were also stratified by experience variables to check for trends.
| RESULTS |
|---|
|
|
|---|
|
The rate of lesion detection decreased as lesion diameter decreased and depth increased (Tables 1, 3). Of 512 potential observations of lesions 510 mm in diameter for all readers, 499 observations (97.5%) were made, excluding a 6-mm "skin" lesion that was seen by only seven (11%) of 64 observers (Fig 2). A 4-mm mass was seen by 53 (83%) of 64 observers. Of 384 potential observations of 3-mm lesions, 274 (71.4%) were made (Figs 3, 4), as were 28 (44%) of 64 potential observations of a 2-mm lesion (Fig 4). The rate of detection for the 4-mm lesion was significantly lower than rates for lesions of 5 mm or larger (P < .001). Those lesions that were 3 mm in diameter tended to be less likely to be detected than the lesion that was 4 mm in diameter (P = .053), and the 2-mm lesion was less likely to be detected than were the 3-mm lesions (P < .001). A relative risk regression model indicated that for each millimeter increase in lesion diameter from 2 to 5 mm, the relative risk of detection was 1.46 (95% confidence interval: 1.40, 1.54) (Table 3).
|
|
|
|
Hyperechoic lesions were the easiest to detect, with a relative risk of detection of 7.13 (95% confidence interval: 3.0, 17.0) (Table 3). In modeling, hypoechoic lesions were easier to identify than were anechoic lesions, with a relative risk of detection of 2.14 (95% confidence interval: 1.04, 4.42) (Table 3, Fig 4), although empirically hypoechoic lesion 5 was less often identified than anechoic lesion 15, which was of the same diameter and at the same depth (Table 1). Lesions with posterior enhancement or, to a lesser degree, posterior shadowing, were more likely to be detected than were those with no posterior features (Table 3).
Most investigators reported neither false-positive findings (median, 0; mean, 0.4; range, 05) nor identification of the same lesion more than once ("duplicates," Fig 1; median, 0; mean, 0.6; range, 03). There were 47 (5%) duplicate lesion descriptions among 936 reported lesions.
Feature Analysis
Feature analysis errors were common for lesions 24 mm in diameter, with only 33% of 3-mm anechoic masses so characterized. For eight lesions 610 mm in diameter, investigators erred in the description of features for a median of 1 lesion (mean, 1.3; range, 04).
Though we dichotomized description of lesion shape (round-oval or irregular), agreement of observers with the true lesion shape was only slightly better than expected by chance, with an average
of 0.14 (Table 4). Agreement was substantial for echogenicity, with an average
of 0.61 (Table 4).
|
|
of 0.45 (Table 4). Use of spatial compounding was elective, and, when used, it decreased the conspicuity of posterior features (Fig 6).
|
|
Measurement of lesion diameters was generally highly accurate. The mean difference between actual largest diameter and measured largest diameter ranged from 0.3 to 0.7 mm for all "parenchymal" lesions, with absolute errors ranging from 1.0 to 4.1 mm and percentage errors ranging from 5% to 28%. Not surprisingly, the greatest percentage error in measurement of lesion diameter occurred for the smallest lesions (ie, those lesions 24 mm in diameter). For the 6-mm "skin" lesion, the mean difference in measured diameter compared with known diameter was 1.4 mm.
Overall Performance
Of a possible score of 17, the mean score was 12.8 (standard error, 0.3), with a range of 716. Of the 64 investigators, 12 (19%) failed to achieve a score of 11.5 or better on the first attempt.
| DISCUSSION |
|---|
|
|
|---|
Those radiologists who interpret at least 300 mammograms per week performed better, on average, than did those interpreting fewer mammograms. This is consistent with other observations of improved performance in breast imaging with greater specialization (20,21). Apparent improved performance of the two radiologists whose technologists perform the US scanning is not beyond expected by random variability and is unlikely to be generalizable; indeed, those same investigators had significantly worse performance in interpreting breast US images in a separate qualifying task (22).
For screening US to be of practical value, cancers smaller than 10 mm in size must be reliably depicted. In the multicenter Radiation Oncology Diagnosis Group V trial (23,24), which was conducted between 1994 and 1996, 561 (76.1%) of 737 nonpalpable breast masses undergoing biopsy were detectable at US by using 7.5-MHz transducers, including 33 (79%) of 42 masses smaller than 10 mm in size (Radiation Oncology Diagnosis Group V, unpublished data, 2001). This is a considerable improvement over the 25% sensitivity of US for masses undergoing biopsy seen in a series in the early 1980s (25) and over the detection of only 8% of malignant lesions smaller than 10 mm in the early work of Sickles et al (26). In our series in a breast US phantom, by using higher-frequency transducers with an average center frequency of 11 MHz, 97.5% of "parenchymal" lesions 510 mm in diameter were detected across multiple observers.
Except for the extreme case of the "skin lesion," superficial lesions were more reliably detected and characterized than were deeper lesions. We did not control for field of view or focal zone settings used by investigators, and each of these can affect lesion and feature conspicuity. Broad bandwidth transducers use lower frequencies at greater depths, which results in decreased resolution for deeper lesions. The phantom is 5 cm thick when the retromammary fat pad is included, and the 3-mm "cyst" at 4 cm depth was particularly difficult for readers to identify. In practice, in the supine position two-thirds of breasts are less than 3 cm thick and 86% are less than 4 cm thick (ACRIN 6666, unpublished observations). The results from this phantom study suggest that performance of US may be diminished in large breasts, although clinical validation is warranted and is in progress (current protocols, ACRIN 6666; available at: www.acrin.org).
One superficial "cyst" (lesion 17) was placed in a corner of each phantom at the "skin" surface, and it was difficult to depict both because of its location near the edge of the phantom and because of the presence of reverberation artifact within it. Even with current 1014-MHz transducers, the most superficial 7 mm of the breast is not optimally evaluated, as the beam cannot be focused more superficially. A glob of gel or standoff pad is still needed for optimal characterization of superficial masses.
In logistic regression modeling, we found hypoechoic and isoechoic lesions were predicted to be easier to detect than anechoic lesions. In practice, such lesions can mimic fat lobules and go clinically undetected. As discussed below, the 3-mm lesions created as anechoic often appeared to contain minimal low-level echoes and were more often interpreted as hypoechoic; this likely falsely improved the apparent detection of lesions deemed "hypoechoic."
The presence of posterior shadowing or especially enhancement greatly facilitated lesion detection. In clinical practice, most cysts will demonstrate posterior enhancement, and this feature dominated our modeling for predicting successful detection compared with hypoechogenicity. Insofar as spatial compounding reduces visibility of posterior features (27,28), our results suggest that lesion detection may be improved if spatial compounding is "off" while surveying the breast, as in screening US. In this study, investigators could use spatial compounding if desired, but its use was not specifically recorded so that its effect could not be determined. When spatial compounding is processed line by line as frequency compounding, rather than frame by frame as true spatial compounding, posterior features may still be apparent, although such processing was not available at the time when our study was performed.
Lesion characterization was more accurate for lesions 5 mm and larger than for smaller lesions. Anechoic lesions mimicking simple cysts smaller than 5 mm were not accurately characterized by experienced investigators, and only 33% of 3-mm anechoic lesions were so recognized. In clinical practice, this may result in a need for excessive rates of follow-up or aspiration of incidental small cysts seen at US. Routine use of tissue harmonic imaging may help distinguish cystic from solid lesions (27,29), although this technique was not available on the equipment used in this study. Spatial compounding may facilitate improved characterization of margins and internal structure once a lesion is detected (27,28,30), although it is unlikely to facilitate proper determination of echogenicity.
Successful distinction of oval circumscribed masses from irregular masses and/or masses with margins that are not circumscribed is critical in appropriate patient care. Most incidental circumscribed oval or gently lobulated solid masses with no posterior features or minimal enhancement may be followed (31,32), whereas irregular masses require intervention. In the phantom, lesions were designed to be either round (n = 16) or irregular (n = 1). In practice, round lesions are more concerning than oval parallel lesions (14,31). As with the recognition of a lesion as anechoic, distinction of a mass as round from one that was irregular was not reliable for the one 4-mm irregular lesion in this phantom, and this again suggests inaccurate management may be more common for small lesions (<5 mm) in clinical practice.
The successful performance of US in both lesion detection and lesion characterization across the majority of the 64 observers in this phantom study is encouraging, though these idealized circumstances may overestimate reliability in clinical practice. Bosch et al (33) found high interexamination agreement in both detection and classification of lesions across three observers independently performing real-time whole-breast US in 58 patients and 113 breasts. The
values were 0.720.75 between pairs of observers, which indicates excellent reliability (33). Of importance,
values exceeded those for mammography across the same observers in the same patients (33).
The phantoms were constructed in a standard fashion, although we did identify a few false-positive hyperechoic lesions in one phantom, which may have confused participating investigators and contributed to premature satisfaction of search for a few observers. One superficial lesion (lesion 8) was designed to be hypoechoic, although it was closer to isoechoic in some of the phantoms, so that reader agreement on echogenicity was not included for that lesion.
One of the challenging aspects of this reader qualification task was the multiplicity of lesions. The systematic recording of at least 12 lesions, including location, size, and features, is laborious and may result in reader fatigue. We found a 5% rate of duplicate lesion reporting. The investigators were experienced in the performance of breast US and interpretation of breast US images, had knowledge that their performances were being evaluated based on lesion detection, and had no specific time constraint. Despite this, there were two readers who detected as few as nine (53%) of 17 lesions. In clinical practice, multiple bilateral lesions may be difficult to accurately report and follow with freehand US, although further study of this issue is warranted.
Despite variability in lesion detection, lesion diameter was reliably recorded for all lesions. This is reassuring when contemplating short-interval follow-up of probably benign lesions seen only at US. The minimal variability in phantom lesion measurements may underestimate variability in clinical practice, however, as neither the phantoms nor individual lesions were compressible. Of importance, and not unexpected, measurement of diameter was more accurate as a percentage of the total diameter for lesions 5 mm and larger. In ACRIN protocol 6666, an increase of 20% or more in lesion volume is considered a true increase (www.acrin.org). Such a calculation may be invalid with lesions 4 mm in diameter or smaller, if the error in each diameter exceeds 20% as seen in this study for such small lesions, though further validation is needed in clinical practice.
Performance of experienced, trained investigators participating in this study may exceed that in routine practice. Sickles et al (20) reported that specialists in breast imaging detected 6 cancers per 1000 screening mammography examinations compared with 3.4 cancers per 1000 detected by general radiologists. This improved performance was seen together with a lower recall rate among the specialists (20). Linver et al (34), and, more recently, Berg et al (35), showed that mammographic interpretive skills could be improved through training. The training materials developed for the ACRIN 6666 investigators are available on request.
There are several other limitations to our study. Investigators could spend an hour scanning the phantom, which is unrealistic in usual clinical practice. Investigators knew that the phantoms had a number of lesions, and they could have accessed a key that described details of how many lesions and of what types were present: Detection rates are likely overestimated, as US detection is facilitated by knowledge that a lesion is present (as with second-look US following magnetic resonance imaging). Consistency of reporting lesion location, diameter, and depth is likely overestimated because the lesions in these phantoms are fixed, and the phantom itself is not compressible. In this phantom, only one irregular lesion was included. While not a goal of our series or this task, it may be desirable to develop a similar phantom that emphasizes subtle distinction of lesion margins as circumscribed or not or of shape as irregular or oval in order to measure reader performance in distinguishing lesions that require biopsy from those that can be followed.
In summary, 97.5% of parenchymal lesions 510 mm in diameter were detected in a breast US phantom; lesions smaller than 5 mm were less consistently identified and were not accurately characterized by experienced investigators. We anticipate similar results in clinical studies of breast US, and validation of these results is ongoing.
| ADVANCES IN KNOWLEDGE |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: ACRIN = American College of Radiology Imaging Network BI-RADS = Breast Imaging Reporting and Data System
See also the article by Madsen et al in this issue.
Author contributions: Guarantors of integrity of entire study, W.A.B., J.D.B., E.B.M.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, W.A.B.; experimental studies, W.A.B., E.B.M., E.L.M.; statistical analysis, W.A.B., J.D.B., J.B.C.; and manuscript editing, all authors
Authors stated no financial relationship to disclose.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
W. A. Berg, J. D. Blume, J. B. Cormack, E. B. Mendelson, D. Lehrer, M. Bohm-Velez, E. D. Pisano, R. A. Jong, W. P. Evans, M. J. Morton, et al. Combined Screening With Ultrasound and Mammography vs Mammography Alone in Women at Elevated Risk of Breast Cancer JAMA, May 14, 2008; 299(18): 2151 - 2163. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Mesurolle, T. Helou, M. El-Khoury, M. Edwardes, E. J. Sutton, and E. Kao Tissue Harmonic Imaging, Frequency Compound Imaging, and Conventional Imaging: Use and Benefit in Breast Sonography J. Ultrasound Med., August 1, 2007; 26(8): 1041 - 1051. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. A. Berg, J. D. Blume, J. B. Cormack, and E. B. Mendelson Operator Dependence of Physician-performed Whole-Breast US: Lesion Detection and Characterization. Radiology, November 1, 2006; 241(2): 355 - 365. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. L. Madsen, W. A. Berg, E. B. Mendelson, and G. R. Frank Anthropomorphic Breast Phantoms for Qualification of Investigators for ACRIN Protocol 6666 Radiology, June 1, 2006; 239(3): 869 - 874. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |