Stanford University San Francisco, California, United States
No financial relationships with ineligible companies to disclose
Robert Fairchild1, Diane Mar2, Mariani Deluna3, Matthew Baker4, suzanne Tamang5, Henry Guo3, David Fiorentino4 and Lorinda Chung6, 1Stanford University, San Francisco, CA, 2VA Palo Alto / Stanford, Palo Alto, CA, 3Stanford University, Palo Alto, CA, 4Stanford University, Menlo Park, CA, 5Stanford University, Atherton, CA, 6Stanford University, Woodside, CA Background/Purpose: Interstitial lung disease (ILD) is a serious complication of SSc and inflammatory myopathy (IM), necessitating accurate and early detection for improved outcomes. Lung ultrasound (LUS) is a potential alternative to computed tomography (CT) for ILD screening. We previously developed interpretation criteria for LUS ILD detection (LUS-ILD-20 and 24 criteria) showing high accuracies. Our current study aimed to test whether convolutional neural networks (CNNs) could accurately detect ILD and its severity, and understand their agreement with, and added benefit to, human interpretation.
Methods: All patients meeting ACR criteria for SSc and randomly selected patients meeting ACR criteria for IM enrolled in our prior LUS studies who underwent LUS cine imaging of 14 lung zones and CT imaging were included (140 patients) (Table 1). For each lung zone, two images were extracted and preprocessed. Data were separated into a training set, labelled as ILD(+/-) using LUS-ILD-24 (1418 images), and an unlabeled test set (1878 images). Several CNN architectures (InceptionV1, VGG-16, and ResNet50) were trained using transfer learning and fine-tuning techniques. We also developed a novel and more efficient CNN (LUS-Net) (Figure 1). CNN performance for ILD detection was evaluated using the test set on: 1) individual lung zone images, and 2) per patient, using summed prediction scores across lung zones. Sensitivity, specificity, receiver operator curves (ROC), and Cohen’s kappa were assessed. Spearman correlation results between each CNN and measures of severity were evaluated, including CT lung pathology quantification using Computer-Aided Lung Informatics for Pathology Evaluation and Ratings (CALIPER), pulmonary function testing (PFT) indices, and LUS-ILD-24 severity.
Results: ILD(+) per CT in the test set of 64 patients was detected by the CNNs with sensitivities and specificities ranging from 93% to 97% and 62% to 91%, respectively (Table 2). Human LUS-ILD-24 showed 97% sensitivity and 85% specificity. Area under the ROC curve for CNNs ranged from 0.93 to 0.96. LUS-ILD-24 and CNNs’ study interpretation showed substantial agreement (κ = 0.63 - 0.80). VGG-16 and LUS-Net showed the best CNN ILD screening performance. VGG-16, LUS-Net and LUS-ILD-24, all identified one identical case as a false negative; CT in this patient showed trace focal reticulation. Significant correlations were observed between CNNs and LUS-ILD-24 severity (very strong), PFT indices (FVC: weak, DLCO: moderate), and CALIPER CT quantification (strong). Combining human and CNN interpretation did not result in improved accuracy.
Conclusion: We trained several state-of-the-art and a novel CNN on a large LUS database in SSc and IM, which showed excellent performance in ILD detection. VGG-16 and the smaller novel LUS-Net performed similarly compared to human interpretation. Combining human and CNN interpretation provided no additional benefit, however in real world use, this may vary based on the experience of the human interpreter. Overall, CNN interpretation alone or in combination with human reading may provide increased confidence and reproducibility to enable LUS to replace CT in ILD screening in select populations.
Images are manually segmented to focus in on the area of interest, adjustments to rotation, brightness, and contrast are made, and images are labelled as positive or negative. To train the CNN, image inputs are processed through a series of convolution operations which extract increasingly complex features. Interposed pooling layers reduce data dimensionality and processing load. After these layers, the data output is fed into fully connected “dense” neural network layers and to an output classifier which produces a probability distribution of the binary classes, ILD(+) and ILD(-). The final output is then checked against the input label (supervised learning) and the model iteratively adjusts over the entire dataset until sufficient classification accuracy is achieved.
*Inflammatory myopathy (IM) patients were not included in the model training and validation set to allow for a sufficient IM sample size in the test set to assess performance. ILD = interstitial lung disease; SSc = systemic sclerosis; LUS = lung ultrasound; CT = computed tomography. CT(+) and CT(-) refer to interstitial lung disease detection on CT imaging at the time of lung
* Convolutional neural networks used summed Softmax scores for all lung zones per patient to classify as either ILD(+) or ILD(-) at the patient level. Softmax cutoff values maximizing sensitivity were 12, 10, 14, and 13 for InceptionV1, VGG-16, ResNet50, and LUS-Net architectures respectively. †Human interpretation was by majority of 3-reader consensus using the revised 2024 lung ultrasound interpretation criteria (LUS-ILD-24); ‡Spearman correlations, all values significant to < 0.05. ILD = interstitial lung disease; AUC = area under the curve: CALIPER = Computer-Aided Lung Informatics for Pathology Evaluation and Ratings (artificial intelligence quantification of computed tomography ILD findings). DLCO = diffusion capacity of the lungs for carbon monoxide. FVC = forced vital capacity. PFTs = pulmonary function tests.
R. Fairchild: BeiGene, 5, Boehringer-Ingelheim, 5, Dren Bio, 2, Gilead, 5, Sonoma Pharmaceuticals, 5; D. Mar: None; M. Deluna: None; M. Baker: None; s. Tamang: None; H. Guo: None; D. Fiorentino: Argenyx, 2, Biogen, 2, Kyverna, 2, Pfizer, 2, Priovant, 2, Serono, 5; L. Chung: Boehringer-Ingelheim, 5, Eicos, 1, 2, Eli Lilly, 2, Genentech, 2, IgM Biosciences, 2, Janssen, 1, Kyverna, 2, Mitsubishi Tanabe, 1, 2.