Posts tagged "robotic"

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 4)

April 14, 2025 · 6 min read

We will evaluate the FedEFM using diverse experiments on our collected EI Dataset.

Federated Endovascular Foundation Model Validation

We first validate our proposed method (FedEFM) and compare it with different foundation models in different learning scenarios. In particular, we consider three scenarios, including Centralized Local Learning (CLL), Client-server Federated Learning (CFL), and Decentralized Federated Learning (DFL). We note that CLL is the traditional training scenario (i.e., no federated learning) where data are merged for local training. Multiple algorithms have been conducted for the comparison purpose, including CLIP, SAM, LVM-Med, FedAvg, MOON, STAR, MATCHA, RING, and CDL. We use ViT backbone in all benchmarking algorithms and train on datasets for the training phase in Table~\ref{tab:data}. Note that our default setup is maintained at 100\% unseen label corpus.

Figure 1 shows the comparison with different algorithms on multiple learning scenarios. When we train ViT in CFL and DFL setup using FedAvg and MATCHA, the accuracy is only 80.9\% and 42.4\%, respectively, reflecting the inherent challenges in federated learning. Applying our proposed FedEFM method resulted in a substantial accuracy improvement to 98.2\% and 97.5\%. These results show that our proposed method can obtain competitive results even compared with the centralized training that can gather all data and only has a minor cycle time trade-off compared with most of the federated learning methods.

Figure 1. Foundation model performance comparison

Downstream Task Fine-tuning

We use ViT backbone and fine-tune it using our FedEFM and different foundation models, including, CLIP, SAM, and LVM-Med. Note that, all models are evaluated under segmentation and classification tasks in endovascular intervention.

Metric. We use Accuracy (\%) for the classification task; 2D Dice score, mIoU, and Jaccard metric are used for the segmentation task. For the segmentation task, we compare on our collected EIPhantom, EISimulation dataset, and CathAnimal. In the classification task, we benchmark using the RANZCR dataset.

Figure 2 shows the comparison between our method and other foundation models. This table shows that the ViT backbone under our proposed algorithm outperforms other models with a clear margin. Furthermore, models trained on medical data such as LVM-Med and our FedEFM archive better results compared with models trained on non-medical data such as CLIP and SAM. This shows that developing a domain-specific foundation model is important in the medical domain.

Figure 2. Fine-tuning results on different foundation models on endovascular classification & segmentation task.

Ablation Study

Unseen Data Proportion Analysis

Figure 3 presents an analysis of our method under different percentages of unseen data. In this experiment, we assume that each silo (hospital) only keeps an amount of data (e.g., human / animal / simulated X-ray) where their data corpus only shares the similarity in a given percentage. A 100\% unseen data corpus means that the data of each hospital silo have no similarity in their data types compared to others. As the percentage of unseen data types increases, we observe a notable decline in the accuracy of the baseline on CFL and DFL scenarios. However, our proposed approach demonstrates remarkable resilience to unseen data, maintaining high accuracy even when confronted with a higher percentage of unfamiliar semantic data. In specific instances, when all data labels are unseen (100\%), ViT under CFL and DFL scenarios exhibit significantly lower accuracies at 32.1\% and 23.8\%, respectively. In contrast, our approach achieves an accuracy of 84.9\%, showcasing its effectiveness in handling unseen data.

Figure 3. Result with different unseen data proportions.

Backbone Analysis

We verify the stability of our method on different networks, including UNet, TransUNet, and SwinUnet and ViT under federated learning scenario. Figure 4 shows the performance of the different backbones when we fine-tune them using our FedEFM. We can see that using our foundation model to initialize the weights of those backbones significantly improves the results. These results validate the effectiveness of our training process in addressing the unseen data problem, and our FedEFM is useful for different backbones in endovascular downstream tasks.

Figure 4. Performance on different network when fine-tuning using our foundation model.

Figure 5 illustrates the catheter and guidewire segmentation results of fine-tuning ViT on our method and different foundation models. The visualization portrays that our method excels in accurately delineating the catheter and guidewire structures, showcasing superior segmentation performance compared to other approaches. This figure further confirms that we can successfully train a federated endovascular foundation model without collecting users' data and the trained foundation model is useful for the downstream segmentation task.

Figure 5. Unseen data issue

Limitations

While our proposed approach demonstrates significant potential, it is subject to certain limitations that warrant further investigation. Firstly, the requirement for additional weight exchange among silos extends the overall training time. However, this limitation is mitigated to some extent by the higher convergence speed of our method compared to other approaches. Additionally, our method is designed for deployment in silos with strong GPU computing resources, but the varying hardware capabilities present in many real-world federated learning networks necessitate further examination. Overcoming these limitations will open new research in federated foundation learning for endovascular interventions and other medical applications. Furthermore, addressing the challenges of managing heterogeneous data distributions and ensuring robust data privacy remains a critical focus. Moving forward, we plan to extend our approach to robotic-assisted endovascular surgery and other areas, such as pathology, to further investigate the application of federated foundation models in medical imaging and robotic systems.

Conclusion

We present a new approach to train an endovascular foundation model in a federated learning setting, leveraging differentiable Earth Mover's Distance and knowledge distillation to handle the unseen data issue. Our method ensures that once the foundational model is trained, its weights can be effectively fine-tuned for downstream tasks, thereby enhancing performance. Our approach achieves state-of-the-art results and contributes to the field of endovascular intervention, particularly by addressing the critical issue of data sharing in the medical domain. By enabling weight exchange among local silos and fostering knowledge transfer, our method improves model generalization while preserving data privacy. Experimental results across various endovascular imaging tasks validate the efficacy of our approach, demonstrating its potential for application in privacy-sensitive medical domains. We will release our implementation and trained models to facilitate reproducibility and further research.

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 3)

April 7, 2025 · 7 min read

In this part, we will deeply dive into the way to train our FedEFM and integrated the weights to various downstream task.

Method

Notations

We describe the notations used in the methodology

Notation	Description
$\theta_i(k)$	Weights in silo $i$ in commuication round $k$
$\xi_i$	Mini batch data in silo $i$
$\theta_{i \rightarrow j}$	Weights in silo $i$ that is transferred to silo $j$
$\hat{\theta}_{i \rightarrow j}$	Successfully transferred weights from silo $j$ back to silo $i$
$\mathcal{L}_c$	Foundation loss function
$\alpha$	Learning rate
$\mathcal{N}_i$	In-neighbors of silo $i$ in the topology
$T$	Temperature for distillation
$\vartheta \in \{0,1\}$	Accumulation status
$N$	Number of silos

Federated Distillation

Figure 1 demonstrate the algorithm used for training a foundation model within a decentralized federated learning process, effectively addressing the issue of the unseen data problem.

Specifically, in the initial round, local model weights $\theta_i$ of each $i$ -th hospital silo is trained using their respective local data $\xi_i$ . Within the next communication round, we first perform overseas training where local model weights $\theta_i$ of each $i$ -th silo is transmitted to each of their $j$ -th neighbor hospital silo. This process aims to let local weights $\theta_i$ learn knowledge from the data $\xi_j$ of its $j$ -th neighbor silo.

In $(k+1)$ -th specific communication round, each transferred weight $\theta_{i\rightarrow j}$ is optimized in $j$ -th silo using the following equation:

\theta_{i\rightarrow j}(k+1)= \theta_{i\rightarrow j}\left(k\right)-\alpha_{k}\nabla \mathcal{L}_{c}\left(\theta_{i\rightarrow j}\left(k\right),\xi_j\left(k\right)\right) \ \ \ (1)

Then, we perform knowledge transfer where each learned overseas expert $\theta_{i\rightarrow j}$ from the previous step is transferred back to the $i$ -th silo.

In the local silo $i$ , the local weight is updated based on both the original weight $\theta_{i}$ and the transferred weights $\hat \theta_{i\rightarrow j}$ that is learned from the neighbour silo $j$ . In particular, we aim to find regions that share similarities between two weights using the Earth Mover’s Distance $\text{EMD}( \theta_{i}, \hat \theta_{i\rightarrow j})$ . In this way, the distance measures the contribution of transferred weights during distillation, enabling the local silo to learn from its neighbors while avoiding divergence when weight convergence goals differ significantly. Local weights $\theta_{i}$ is then optimized using the following equation:

\theta_{i}(k+1) = \theta_i(k)-\\ \alpha_{k}\sum_{j \in \mathcal{N}(i)}\text{EMD}( \theta_{i}, \hat \theta_{i\rightarrow j},k)\nabla \mathcal{L}^i_{\rm MD}\left({\theta}_i\left(k\right),{\hat \theta}_{i\rightarrow j}\left(k\right),\xi_i\left(k\right)\right) \ \ (2)

Differentiable Earth Mover's Distance

Assume that the input sample $\xi_i$ from $i$ -th local silo passes through the foundation architecture $\theta_{i}$ to generate the dense representation $\mathbf{U} \in \mathbb{R}^{H \times W \times C}$ , where $H$ and $W$ denote the spatial size of the feature map and $C$ is the feature dimension. In a parallel manner, $\mathbf{V} \in \mathbb{R}^{H \times W \times C}$ also denotes the dense representation when $\xi_i$ passes through $\hat{\theta}_{i\rightarrow j}$ .

Under Earth Mover circumstance, $\mathbf{V}$ represents suppliers transporting goods to demanders $\mathbf{U}$ . Then, $\text{EMD}$ between two feature sets $\mathbf{U} = \{u_1, u_2, \ldots, u_{HW}\}$ and $\mathbf{V} = \{v_1, v_2, \ldots, v_{HW}\}$ can be computed as:

\text{EMD}(\theta_{i}, \hat \theta_{i \rightarrow j}) = \text{EMD}(\mathbf{U}, \mathbf{V}) = \sum_{p=1}^{HW} \sum_{q=1}^{HW} (1 - c_{pq}) \tilde{x}_{pq} \ \ \ (3)

where $\tilde{x}$ is conducted from optimal matching flow $\tilde{X} = \{x_1, x_2, \ldots, x_{pq}\} $ for each sample pair of two sets $\mathbf{U}$ and $\mathbf{V}$ ; $c_{pq}$ is the cost per unit transported from supplier to demander and is obtained by computing the pairwise distance between embedding nodes $u_p \subset \mathbf{U}$ and $v_q \subset \mathbf{V}$ .

The cost per unit $c_{pq}$ is computed as below and also plays a virtual role in computing the optimal matching flow:

c_{pq} = 1 - \frac{u_p^T v_q}{\|u_p\|\|v_q\|}

where nodes with similar representations tend to generate small matching costs between each other. Then, the optimal matching flow $\tilde{X}$ is conducted by optimizing $\tilde{x}$ as below:

\underset{x}{\text{minimize}} \quad \sum_{p=1}^{HW} \sum_{q=1}^{HW} c_{pq} x_{pq} \\ \text{subject to} \quad x_{pq} > 0, \quad p = 1, \ldots, HW, \; q = 1, \ldots, HW\\ \sum_{p=1}^{HW} x_{pq} = v_q, \quad q = 1, \ldots, HW \\ \sum_{q=1}^{HW} x_{pq} = u_p, \quad p = 1, \ldots, HW

Here, EMD seeks an optimal matching $\tilde{X}$ between suppliers and demanders such that the overall matching cost is minimized. The global optimal matching flows $\tilde{X}$ can be achieved by solving a Linear Programming problem (LP). For the sake of completeness, we transform the above optimization to a compact matrix form

\underset{x}{\text{minimize}} \quad c(\theta)^T x \\ \text{subject to} \quad G(\theta)x \leq h(\theta),\\ A(\theta)x = b(\theta).

Here $x \in \mathbb{R}^{HW \times HW}$ is our optimization variable. $Ax = b$ represents the equality constraint and $Gx \leq h$ denotes the inequality constraint in our optimization problem. Accordingly, the Lagrangian of the LP problem is given by:

L(\theta, x, \nu, \lambda) = c^T x + \lambda^T (Gx - h) + \nu^T (Ax - b),

where $\nu$ denotes the dual variables on the equality constraints and $\lambda \geq 0$ denotes the dual variables on the inequality constraints. Following the KKT conditions, we obtain the optimum $(\tilde{x}, \tilde{\nu}, \tilde{\lambda})$ of the objective function by solving $g(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}) = 0$ with primal-dual interior point methods, where

g(\theta, x, \nu, \lambda) = \begin{bmatrix} \nabla_{\theta} L(\theta, x, \nu, \lambda) \\ \textbf{diag}(\lambda)(G(\theta)x - h(\theta)) \\ A(\theta)x - b(\theta) \end{bmatrix}.

Then, with the theorem below, we can derive the gradients of the LP parameters.

Suppose $g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) = 0$ . Then, when all derivatives exist, the partial Jacobian of $\tilde{x}$ with respect to $\theta$ at the optimal solution $(\tilde{\lambda}, \tilde{\nu}, \tilde{x})$ , namely $J_{\theta}\tilde{x}$ , can be obtained by satisfying:

J_{\theta}\tilde{x} = - \left( J_{x} g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) \right)^{-1} J_{\theta} g(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}).

Then, applying to the KKT conditions, the (partial) Jacobian with respect to $\theta$ can be defined as:

J_{\theta} g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) = \begin{bmatrix} J_{\theta} \nabla_{x} L(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}) \\ \textbf{diag}(\tilde{\lambda}) J_{\theta} (G(\theta)x - h(\theta)) \\ J_{\theta} (A(\theta) \tilde{x} - b(\theta)) \end{bmatrix}

After obtaining the optimal $\tilde{x}$ , we can derive a closed-form gradient for $\theta$ , enabling efficient backpropagation without altering the optimization path.

Figure 1. Federated Knowledge Distillation pipeline with EMD Distance

Training

The distillation loss of $i$ -th silo $\mathcal{L}^i_{\rm MD}$ based on student model loss is designed as:

\mathcal{L}^i_{\rm MD} = \beta T^2 \sum^{\mathcal{N}(i)}_{j=1} \left( \mathcal{L}_{c}(Q^\tau_{S_i}, Q^\tau_{T_{i\rightarrow j}}) \right) + (1-\beta)\mathcal{L}_{c}(Q_{S_i},y^i_{true})\ \ (11)

where $Q_S$ is the standard softmax output of the local student; $y^i_{true}$ is the ground-truth labels, $\beta$ is a hyper-parameter for controlling the importance of each loss component; $Q^\tau_{S_i}, Q^\tau_{T_{i\rightarrow j}}$ are the softened outputs of the $i$ -th local student and the $j$ -th overseas teachers using the same temperature parameter

Q^\tau_k = \frac{\exp(l_k/T)}{\sum_{k} \exp(l_k/T)}

where the logit $l$ is outputted from the pre-final layers for both teacher and student models. Besides, the objective function computed for each $j$ -th contributed transferrable weights is controlled by the corresponding EMD to ensure the learning convergence.

When the training in all silos is completed in each communication round, local model weights in all silos are aggregated to obtain global weights $\Theta = \sum^{N-1}_{i = 0 }\vartheta_i{\theta}_i$ , which are further utilized for downstream fine-tuning.

In the next part, we will validate the effectiveness of our FedEFM on our Endovascular Intervention Dataset.

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 2)

March 29, 2025 · 4 min read

In this part, we will outline the data collection process and how the dataset is managed in the training and fine-tuning phases.

Robotic Setup

To collect large-scale X-ray images, we employ a robotic platform and a full-size silicon phantom. A surgeon uses a master device joystick to control a follower robot for cannulating three arteries: the left subclavian (LSA), left common carotid (LCCA), and right common carotid (RCCA). During each catheterization procedure, the surgeon activates the X-ray fluoroscopy using a pedal in the operating room. The experiments are conducted using the Epsilon X-ray Generator. We develop a real-time image grabber to transmit the video feed of the surgical scene to a workstation, a computer-based device equipped with an 8-Core ARM v8.2 64-bit CPU. Overall, we collect and label 4,700 new X-ray images to create our EIPhantom dataset. An overview of our robotic setup is demonstrated in Figure 1.

Figure 1. Data collection with endovascular robot.

Data collection

Apart from X-ray images collected from our real robot, we also collect an EISimulation dataset from the CathSim simulator for simulated X-ray images. We manually label both data from the robot and CathSim simulator to use them in downstream tasks. We note that the datasets used to train the foundation model are not being used in downstream endovascular understanding tasks. Figure 2 provides a detailed summary of the datasets used for training and fine-tuning. Additionally, we also visualize samples from each dataset in Figure 3.

Figure 2. X-ray dataset used in our work.

CathACtion

Vessel 12

DRIVE

SenNet

Medical Decathlon

EI Simulator

EI Phantom

RANZCR

Cath Animal

Figure 3. Visualization of datasets used in our work

Motivation

Our goal is to train a federated foundation model for endovascular intervention that incorporates all available types of X-ray data. However, in practice, each hospital (silo) possesses specific data sources that may not be accessible to others. This results in disparities in data corpora across institutions, meaning that certain datasets are present in one hospital but absent in another. Figure 4 illustrates this challenge, which leads to the unseen data issue—an obstacle that must be addressed to ensure effective federated training.

Federated learning preserves data privacy by preventing direct data sharing while allowing the exchange of model weights among hospital silos. To leverage this feature, we introduce the Federated Endovascular Foundation Model (FedEFM), a multishot federated distillation algorithm that utilizes Earth Mover’s Distance (EMD) to facilitate learning. Our approach enables local silo models to learn from neighboring silos and incorporate the acquired knowledge back into their own models through a distillation process. Unlike traditional methods that require consistent label sets across both local and global models trained within federated silos, devices, or servers, our method ensures seamless federated training without requiring hospitals to share their datasets—further enhancing data privacy. Additionally, once trained, the foundational model’s weights provide a valuable initialization for downstream tasks.

Figure 4. Unseen data issue

In the next part, we will explore the technical method to train our FedEFM.

FedEFM Federated Endovascular Foundation Model with Unseen Data (Part 1)

March 19, 2025 · 6 min read

In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar-domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover's Distance within a knowledge distillation framework. Once trained, our foundation model's weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.

Introduction

Endovascular surgery is now usually a minimally invasive procedure that diagnoses and treats vascular diseases with several advantages such as reduced trauma and quick recovery time. During endovascular surgery, surgeons use catheters and guidewires to access arteries. However, this procedure also entails risks such as potential vessel wall damage. Precise identification of catheters and guidewires within X-ray images is crucial for patient safety. The rise of deep learning has played a vital role in improving surgical precision and enhancing patient safety in endovascular intervention. However, accurately segmenting intricate catheters and guidewires in X-ray images remains challenging due to the limited quantity of data.

Recently, vision-language models have gained significant attention from researchers across various fields. Foundation models like CLIP and ALIGN have demonstrated strong capabilities in cross-modal alignment and zero-shot learning tasks. In the medical field, EndoFM and LVM-Med have been introduced as foundation models designed to handle medical data across multiple modalities. While these models perform well on downstream tasks, they typically assume that data can be centrally collected and trained, which is often difficult in medical applications.

Gathering large-scale medical data is particularly challenging due to privacy concerns. To address this issue, federated learning has emerged as a potential solution, allowing models to be trained collaboratively across multiple hospital silos without requiring direct access to patient data.

Despite its benefits, federated learning faces challenges such as ensuring stable convergence across different silos and handling heterogeneous data. In endovascular interventions, these challenges arise primarily from variations in data collected from different sources, leading to domain gaps in X-ray images. As illustrated in Figure 1, X-ray images from various endovascular datasets differ significantly. Additionally, due to privacy restrictions, datasets containing real human X-ray images tend to be smaller compared to those obtained from animal models, silicon phantoms, or simulated environments.

Phantom X-ray

Animal X-ray

Human X-ray

Simulation X-ray

Figure 1. Different types of endovascular X-ray data. We aim to train a foundation model which can leverage diverse X-ray data from multiple hospitals (silos)

In this work, our goal is to train a foundation model using diverse endovascular datasets with federated learning. Since we aim to use all possible endovascular data (i.e., from humans, animals, phantoms, etc.), there is an unseen data problem between silos. To tackle this problem, we propose the Federated Endovascular Foundation Model (FedEFM), a new distillation algorithm using differentiable Earth Mover's Distance (EMD). Once trained, FedEFM provides crucial initializations for downstream tasks, thereby enhancing task-specific performance. Our approach outperforms existing methods and holds significant potential for application in robotic-assisted endovascular surgery, while effectively maintaining data privacy.

Our contribution can be summarized as below:

We propose a new method to train a federated endovascular foundation model with unseen data using a multishot distillation technique.
We propose the Multishot Foundation Federated Distillation algorithm (MFD), powered by differentiable Earth Mover's Distance, to address the unseen label corpus issue and ensure the feasibility of learning for the foundation model.
We collect new datasets for training endovascular foundation models. Our proposed model is verified under several downstream tasks.

Related Works

Endovascular Intervention

Endovascular intervention has greatly improved the treatment of vascular diseases such as aneurysms and embolisms using X-ray fluoroscopy. However, these procedures still encounter challenges, including low contrast, complex anatomical structures, and the scarcity of expert-annotated data. Recent studies have aimed to address these issues through advancements in imaging technology and machine learning techniques. For instance, researchers have proposed an enhanced U-Net-based approach for localizing guidewire endpoints in X-ray images. More recently, FW-Net has been introduced to improve catheter segmentation by utilizing frame-to-frame temporal consistency. While many studies focus on conventional tasks, few have explored the development of foundation models for endovascular intervention. A key obstacle is the strict requirement for patient data privacy, which significantly limits the ability to train such models.

Figure 2. Endovascular procedure.

Federated Learning

Federated learning has emerged as a promising solution for training machine learning models on decentralized data while preserving data privacy. This approach is especially valuable in the medical field, where data sensitivity and confidentiality are critical concerns. Although numerous studies have investigated the use of federated learning for training foundation models in healthcare, privacy challenges can be mitigated but not entirely eliminated. A major hurdle is the heterogeneity and non-IID nature of medical data across different institutions. Moreover, the issue of unseen data-where certain data types appear in some datasets but are missing in others—further complicates model training and generalization.

Figure 3. Decentralized federated learning setup

Knowledge Distillation with Earth Mover’s Distance.

Knowledge distillation involves transferring knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). In the context of federated learning, distillation can be used to enable local models to learn from aggregated global models without sharing raw data. The Earth Mover's Distance (EMD), also known as the Wasserstein distance, measures the dissimilarity between two probability distributions and is particularly useful for comparing distributions that do not have overlapping support. By leveraging the differentiable EMD, it is possible to align distributions of labels across different models, facilitating better model convergence and knowledge transfer. In this paper, we leverage EMD within a distillation training process to address the unseen label data issue when training endovascular foundation models in federated scenarios.

In the next part, we will explore the process of collecting and managing our dataset for training the foundation model.

4 posts tagged with "robotic"

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 4)

Federated Endovascular Foundation Model Validation

Downstream Task Fine-tuning