14 posts tagged with "medical"

View All Tags

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 4)

We will evaluate the FedEFM using diverse experiments on our collected EI Dataset.

Federated Endovascular Foundation Model Validation

We first validate our proposed method (FedEFM) and compare it with different foundation models in different learning scenarios. In particular, we consider three scenarios, including Centralized Local Learning (CLL), Client-server Federated Learning (CFL), and Decentralized Federated Learning (DFL). We note that CLL is the traditional training scenario (i.e., no federated learning) where data are merged for local training. Multiple algorithms have been conducted for the comparison purpose, including CLIP, SAM, LVM-Med, FedAvg, MOON, STAR, MATCHA, RING, and CDL. We use ViT backbone in all benchmarking algorithms and train on datasets for the training phase in Table~\ref{tab:data}. Note that our default setup is maintained at 100\% unseen label corpus.

Figure 1 shows the comparison with different algorithms on multiple learning scenarios. When we train ViT in CFL and DFL setup using FedAvg and MATCHA, the accuracy is only 80.9\% and 42.4\%, respectively, reflecting the inherent challenges in federated learning. Applying our proposed FedEFM method resulted in a substantial accuracy improvement to 98.2\% and 97.5\%. These results show that our proposed method can obtain competitive results even compared with the centralized training that can gather all data and only has a minor cycle time trade-off compared with most of the federated learning methods.

Figure 1. Foundation model performance comparison

Downstream Task Fine-tuning

We use ViT backbone and fine-tune it using our FedEFM and different foundation models, including, CLIP, SAM, and LVM-Med. Note that, all models are evaluated under segmentation and classification tasks in endovascular intervention.

Metric. We use Accuracy (\%) for the classification task; 2D Dice score, mIoU, and Jaccard metric are used for the segmentation task. For the segmentation task, we compare on our collected EIPhantom, EISimulation dataset, and CathAnimal. In the classification task, we benchmark using the RANZCR dataset.

Figure 2 shows the comparison between our method and other foundation models. This table shows that the ViT backbone under our proposed algorithm outperforms other models with a clear margin. Furthermore, models trained on medical data such as LVM-Med and our FedEFM archive better results compared with models trained on non-medical data such as CLIP and SAM. This shows that developing a domain-specific foundation model is important in the medical domain.

Figure 2. Fine-tuning results on different foundation models on endovascular classification & segmentation task.

Ablation Study

Unseen Data Proportion Analysis

Figure 3 presents an analysis of our method under different percentages of unseen data. In this experiment, we assume that each silo (hospital) only keeps an amount of data (e.g., human / animal / simulated X-ray) where their data corpus only shares the similarity in a given percentage. A 100\% unseen data corpus means that the data of each hospital silo have no similarity in their data types compared to others. As the percentage of unseen data types increases, we observe a notable decline in the accuracy of the baseline on CFL and DFL scenarios. However, our proposed approach demonstrates remarkable resilience to unseen data, maintaining high accuracy even when confronted with a higher percentage of unfamiliar semantic data. In specific instances, when all data labels are unseen (100\%), ViT under CFL and DFL scenarios exhibit significantly lower accuracies at 32.1\% and 23.8\%, respectively. In contrast, our approach achieves an accuracy of 84.9\%, showcasing its effectiveness in handling unseen data.

Figure 3. Result with different unseen data proportions.

Backbone Analysis

We verify the stability of our method on different networks, including UNet, TransUNet, and SwinUnet and ViT under federated learning scenario. Figure 4 shows the performance of the different backbones when we fine-tune them using our FedEFM. We can see that using our foundation model to initialize the weights of those backbones significantly improves the results. These results validate the effectiveness of our training process in addressing the unseen data problem, and our FedEFM is useful for different backbones in endovascular downstream tasks.

Figure 4. Performance on different network when fine-tuning using our foundation model.

Figure 5 illustrates the catheter and guidewire segmentation results of fine-tuning ViT on our method and different foundation models. The visualization portrays that our method excels in accurately delineating the catheter and guidewire structures, showcasing superior segmentation performance compared to other approaches. This figure further confirms that we can successfully train a federated endovascular foundation model without collecting users' data and the trained foundation model is useful for the downstream segmentation task.

Figure 5. Unseen data issue

Limitations

While our proposed approach demonstrates significant potential, it is subject to certain limitations that warrant further investigation. Firstly, the requirement for additional weight exchange among silos extends the overall training time. However, this limitation is mitigated to some extent by the higher convergence speed of our method compared to other approaches. Additionally, our method is designed for deployment in silos with strong GPU computing resources, but the varying hardware capabilities present in many real-world federated learning networks necessitate further examination. Overcoming these limitations will open new research in federated foundation learning for endovascular interventions and other medical applications. Furthermore, addressing the challenges of managing heterogeneous data distributions and ensuring robust data privacy remains a critical focus. Moving forward, we plan to extend our approach to robotic-assisted endovascular surgery and other areas, such as pathology, to further investigate the application of federated foundation models in medical imaging and robotic systems.

Conclusion

We present a new approach to train an endovascular foundation model in a federated learning setting, leveraging differentiable Earth Mover's Distance and knowledge distillation to handle the unseen data issue. Our method ensures that once the foundational model is trained, its weights can be effectively fine-tuned for downstream tasks, thereby enhancing performance. Our approach achieves state-of-the-art results and contributes to the field of endovascular intervention, particularly by addressing the critical issue of data sharing in the medical domain. By enabling weight exchange among local silos and fostering knowledge transfer, our method improves model generalization while preserving data privacy. Experimental results across various endovascular imaging tasks validate the efficacy of our approach, demonstrating its potential for application in privacy-sensitive medical domains. We will release our implementation and trained models to facilitate reproducibility and further research.

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 3)

In this part, we will deeply dive into the way to train our FedEFM and integrated the weights to various downstream task.

Method

Notations

We describe the notations used in the methodology

NotationDescription
θi(k)\theta_i(k)Weights in silo ii in commuication round kk
ξi\xi_iMini batch data in silo ii
θij\theta_{i \rightarrow j}Weights in silo ii that is transferred to silo jj
θ^ij\hat{\theta}_{i \rightarrow j}Successfully transferred weights from silo jj back to silo ii
Lc\mathcal{L}_cFoundation loss function
α\alphaLearning rate
Ni\mathcal{N}_iIn-neighbors of silo ii in the topology
TTTemperature for distillation
ϑ{0,1}\vartheta \in \{0,1\}Accumulation status
NNNumber of silos

Federated Distillation

Figure 1 demonstrate the algorithm used for training a foundation model within a decentralized federated learning process, effectively addressing the issue of the unseen data problem.

Specifically, in the initial round, local model weights θi\theta_i of each ii-th hospital silo is trained using their respective local data ξi\xi_i. Within the next communication round, we first perform overseas training where local model weights θi\theta_i of each ii-th silo is transmitted to each of their jj-th neighbor hospital silo. This process aims to let local weights θi\theta_i learn knowledge from the data ξj\xi_j of its jj-th neighbor silo.

In (k+1)(k+1)-th specific communication round, each transferred weight θij\theta_{i\rightarrow j} is optimized in jj-th silo using the following equation:

θij(k+1)=θij(k)αkLc(θij(k),ξj(k))   (1)\theta_{i\rightarrow j}(k+1)= \theta_{i\rightarrow j}\left(k\right)-\alpha_{k}\nabla \mathcal{L}_{c}\left(\theta_{i\rightarrow j}\left(k\right),\xi_j\left(k\right)\right) \ \ \ (1)

Then, we perform knowledge transfer where each learned overseas expert θij\theta_{i\rightarrow j} from the previous step is transferred back to the ii-th silo.

In the local silo ii, the local weight is updated based on both the original weight θi\theta_{i} and the transferred weights θ^ij\hat \theta_{i\rightarrow j} that is learned from the neighbour silo jj. In particular, we aim to find regions that share similarities between two weights using the Earth Mover’s Distance EMD(θi,θ^ij)\text{EMD}( \theta_{i}, \hat \theta_{i\rightarrow j}). In this way, the distance measures the contribution of transferred weights during distillation, enabling the local silo to learn from its neighbors while avoiding divergence when weight convergence goals differ significantly. Local weights θi\theta_{i} is then optimized using the following equation:

θi(k+1)=θi(k)αkjN(i)EMD(θi,θ^ij,k)LMDi(θi(k),θ^ij(k),ξi(k))  (2)\theta_{i}(k+1) = \theta_i(k)-\\ \alpha_{k}\sum_{j \in \mathcal{N}(i)}\text{EMD}( \theta_{i}, \hat \theta_{i\rightarrow j},k)\nabla \mathcal{L}^i_{\rm MD}\left({\theta}_i\left(k\right),{\hat \theta}_{i\rightarrow j}\left(k\right),\xi_i\left(k\right)\right) \ \ (2)

Differentiable Earth Mover's Distance

Assume that the input sample ξi\xi_i from ii-th local silo passes through the foundation architecture θi\theta_{i} to generate the dense representation URH×W×C\mathbf{U} \in \mathbb{R}^{H \times W \times C}, where HH and WW denote the spatial size of the feature map and CC is the feature dimension. In a parallel manner, VRH×W×C\mathbf{V} \in \mathbb{R}^{H \times W \times C} also denotes the dense representation when ξi\xi_i passes through θ^ij\hat{\theta}_{i\rightarrow j}.

Under Earth Mover circumstance, V\mathbf{V} represents suppliers transporting goods to demanders U\mathbf{U}. Then, EMD\text{EMD} between two feature sets U={u1,u2,,uHW}\mathbf{U} = \{u_1, u_2, \ldots, u_{HW}\} and V={v1,v2,,vHW}\mathbf{V} = \{v_1, v_2, \ldots, v_{HW}\} can be computed as:

EMD(θi,θ^ij)=EMD(U,V)=p=1HWq=1HW(1cpq)x~pq   (3)\text{EMD}(\theta_{i}, \hat \theta_{i \rightarrow j}) = \text{EMD}(\mathbf{U}, \mathbf{V}) = \sum_{p=1}^{HW} \sum_{q=1}^{HW} (1 - c_{pq}) \tilde{x}_{pq} \ \ \ (3)

where x~\tilde{x} is conducted from optimal matching flow \tilde{X} = \{x_1, x_2, \ldots, x_{pq}\} $ for each sample pair of two sets $\mathbf{U} and V\mathbf{V}; cpqc_{pq} is the cost per unit transported from supplier to demander and is obtained by computing the pairwise distance between embedding nodes upUu_p \subset \mathbf{U} and vqVv_q \subset \mathbf{V}.

The cost per unit cpqc_{pq} is computed as below and also plays a virtual role in computing the optimal matching flow:

cpq=1upTvqupvqc_{pq} = 1 - \frac{u_p^T v_q}{\|u_p\|\|v_q\|}

where nodes with similar representations tend to generate small matching costs between each other. Then, the optimal matching flow X~\tilde{X} is conducted by optimizing x~\tilde{x} as below:

minimizexp=1HWq=1HWcpqxpqsubject toxpq>0,p=1,,HW,  q=1,,HWp=1HWxpq=vq,q=1,,HWq=1HWxpq=up,p=1,,HW\underset{x}{\text{minimize}} \quad \sum_{p=1}^{HW} \sum_{q=1}^{HW} c_{pq} x_{pq} \\ \text{subject to} \quad x_{pq} > 0, \quad p = 1, \ldots, HW, \; q = 1, \ldots, HW\\ \sum_{p=1}^{HW} x_{pq} = v_q, \quad q = 1, \ldots, HW \\ \sum_{q=1}^{HW} x_{pq} = u_p, \quad p = 1, \ldots, HW

Here, EMD seeks an optimal matching X~\tilde{X} between suppliers and demanders such that the overall matching cost is minimized. The global optimal matching flows X~\tilde{X} can be achieved by solving a Linear Programming problem (LP). For the sake of completeness, we transform the above optimization to a compact matrix form

minimizexc(θ)Txsubject toG(θ)xh(θ),A(θ)x=b(θ).\underset{x}{\text{minimize}} \quad c(\theta)^T x \\ \text{subject to} \quad G(\theta)x \leq h(\theta),\\ A(\theta)x = b(\theta).

Here xRHW×HWx \in \mathbb{R}^{HW \times HW} is our optimization variable. Ax=bAx = b represents the equality constraint and GxhGx \leq h denotes the inequality constraint in our optimization problem. Accordingly, the Lagrangian of the LP problem is given by:

L(θ,x,ν,λ)=cTx+λT(Gxh)+νT(Axb),L(\theta, x, \nu, \lambda) = c^T x + \lambda^T (Gx - h) + \nu^T (Ax - b),

where ν\nu denotes the dual variables on the equality constraints and λ0\lambda \geq 0 denotes the dual variables on the inequality constraints. Following the KKT conditions, we obtain the optimum (x~,ν~,λ~)(\tilde{x}, \tilde{\nu}, \tilde{\lambda}) of the objective function by solving g(θ,x~,ν~,λ~)=0g(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}) = 0 with primal-dual interior point methods, where

g(θ,x,ν,λ)=[θL(θ,x,ν,λ)diag(λ)(G(θ)xh(θ))A(θ)xb(θ)].g(\theta, x, \nu, \lambda) = \begin{bmatrix} \nabla_{\theta} L(\theta, x, \nu, \lambda) \\ \textbf{diag}(\lambda)(G(\theta)x - h(\theta)) \\ A(\theta)x - b(\theta) \end{bmatrix}.

Then, with the theorem below, we can derive the gradients of the LP parameters.

Suppose g(θ,λ~,ν~,x~)=0g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) = 0. Then, when all derivatives exist, the partial Jacobian of x~\tilde{x} with respect to θ\theta at the optimal solution (λ~,ν~,x~)(\tilde{\lambda}, \tilde{\nu}, \tilde{x}), namely Jθx~J_{\theta}\tilde{x}, can be obtained by satisfying:

Jθx~=(Jxg(θ,λ~,ν~,x~))1Jθg(θ,x~,ν~,λ~).J_{\theta}\tilde{x} = - \left( J_{x} g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) \right)^{-1} J_{\theta} g(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}).

Then, applying to the KKT conditions, the (partial) Jacobian with respect to θ\theta can be defined as:

Jθg(θ,λ~,ν~,x~)=[JθxL(θ,x~,ν~,λ~)diag(λ~)Jθ(G(θ)xh(θ))Jθ(A(θ)x~b(θ))]J_{\theta} g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) = \begin{bmatrix} J_{\theta} \nabla_{x} L(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}) \\ \textbf{diag}(\tilde{\lambda}) J_{\theta} (G(\theta)x - h(\theta)) \\ J_{\theta} (A(\theta) \tilde{x} - b(\theta)) \end{bmatrix}

After obtaining the optimal x~\tilde{x}, we can derive a closed-form gradient for θ\theta, enabling efficient backpropagation without altering the optimization path.

Figure 1. Federated Knowledge Distillation pipeline with EMD Distance

Training

The distillation loss of ii-th silo LMDi\mathcal{L}^i_{\rm MD} based on student model loss is designed as:

LMDi=βT2j=1N(i)(Lc(QSiτ,QTijτ))+(1β)Lc(QSi,ytruei)  (11)\mathcal{L}^i_{\rm MD} = \beta T^2 \sum^{\mathcal{N}(i)}_{j=1} \left( \mathcal{L}_{c}(Q^\tau_{S_i}, Q^\tau_{T_{i\rightarrow j}}) \right) + (1-\beta)\mathcal{L}_{c}(Q_{S_i},y^i_{true})\ \ (11)

where QSQ_S is the standard softmax output of the local student; ytrueiy^i_{true} is the ground-truth labels, β\beta is a hyper-parameter for controlling the importance of each loss component; QSiτ,QTijτQ^\tau_{S_i}, Q^\tau_{T_{i\rightarrow j}} are the softened outputs of the ii-th local student and the jj-th overseas teachers using the same temperature parameter

Qkτ=exp(lk/T)kexp(lk/T)Q^\tau_k = \frac{\exp(l_k/T)}{\sum_{k} \exp(l_k/T)}

where the logit ll is outputted from the pre-final layers for both teacher and student models. Besides, the objective function computed for each jj-th contributed transferrable weights is controlled by the corresponding EMD to ensure the learning convergence.

When the training in all silos is completed in each communication round, local model weights in all silos are aggregated to obtain global weights Θ=i=0N1ϑiθi\Theta = \sum^{N-1}_{i = 0 }\vartheta_i{\theta}_i, which are further utilized for downstream fine-tuning.

Next

In the next part, we will validate the effectiveness of our FedEFM on our Endovascular Intervention Dataset.

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 2)

In this part, we will outline the data collection process and how the dataset is managed in the training and fine-tuning phases.

Robotic Setup

To collect large-scale X-ray images, we employ a robotic platform and a full-size silicon phantom. A surgeon uses a master device joystick to control a follower robot for cannulating three arteries: the left subclavian (LSA), left common carotid (LCCA), and right common carotid (RCCA). During each catheterization procedure, the surgeon activates the X-ray fluoroscopy using a pedal in the operating room. The experiments are conducted using the Epsilon X-ray Generator. We develop a real-time image grabber to transmit the video feed of the surgical scene to a workstation, a computer-based device equipped with an 8-Core ARM v8.2 64-bit CPU. Overall, we collect and label 4,700 new X-ray images to create our EIPhantom dataset. An overview of our robotic setup is demonstrated in Figure 1.

Figure 1. Data collection with endovascular robot.

Data collection

Apart from X-ray images collected from our real robot, we also collect an EISimulation dataset from the CathSim simulator for simulated X-ray images. We manually label both data from the robot and CathSim simulator to use them in downstream tasks. We note that the datasets used to train the foundation model are not being used in downstream endovascular understanding tasks. Figure 2 provides a detailed summary of the datasets used for training and fine-tuning. Additionally, we also visualize samples from each dataset in Figure 3.

Figure 2. X-ray dataset used in our work.

CathACtion

CathACtion

Vessel 12

Vessel 12

DRIVE

DRIVE

SenNet

SenNet

Medical Decathlon

Medical Decathlon

EI Simulator

EI Simulator

EI Phantom

EI Phantom

RANZCR

RANZCR

Cath Animal

Cath Animal

Figure 3. Visualization of datasets used in our work

Motivation

Our goal is to train a federated foundation model for endovascular intervention that incorporates all available types of X-ray data. However, in practice, each hospital (silo) possesses specific data sources that may not be accessible to others. This results in disparities in data corpora across institutions, meaning that certain datasets are present in one hospital but absent in another. Figure 4 illustrates this challenge, which leads to the unseen data issue—an obstacle that must be addressed to ensure effective federated training.

Federated learning preserves data privacy by preventing direct data sharing while allowing the exchange of model weights among hospital silos. To leverage this feature, we introduce the Federated Endovascular Foundation Model (FedEFM), a multishot federated distillation algorithm that utilizes Earth Mover’s Distance (EMD) to facilitate learning. Our approach enables local silo models to learn from neighboring silos and incorporate the acquired knowledge back into their own models through a distillation process. Unlike traditional methods that require consistent label sets across both local and global models trained within federated silos, devices, or servers, our method ensures seamless federated training without requiring hospitals to share their datasets—further enhancing data privacy. Additionally, once trained, the foundational model’s weights provide a valuable initialization for downstream tasks.

Figure 4. Unseen data issue

Next

In the next part, we will explore the technical method to train our FedEFM.

FedEFM Federated Endovascular Foundation Model with Unseen Data (Part 1)

In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar-domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover's Distance within a knowledge distillation framework. Once trained, our foundation model's weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.

Introduction

Endovascular surgery is now usually a minimally invasive procedure that diagnoses and treats vascular diseases with several advantages such as reduced trauma and quick recovery time. During endovascular surgery, surgeons use catheters and guidewires to access arteries. However, this procedure also entails risks such as potential vessel wall damage. Precise identification of catheters and guidewires within X-ray images is crucial for patient safety. The rise of deep learning has played a vital role in improving surgical precision and enhancing patient safety in endovascular intervention. However, accurately segmenting intricate catheters and guidewires in X-ray images remains challenging due to the limited quantity of data.

Recently, vision-language models have gained significant attention from researchers across various fields. Foundation models like CLIP and ALIGN have demonstrated strong capabilities in cross-modal alignment and zero-shot learning tasks. In the medical field, EndoFM and LVM-Med have been introduced as foundation models designed to handle medical data across multiple modalities. While these models perform well on downstream tasks, they typically assume that data can be centrally collected and trained, which is often difficult in medical applications.

Gathering large-scale medical data is particularly challenging due to privacy concerns. To address this issue, federated learning has emerged as a potential solution, allowing models to be trained collaboratively across multiple hospital silos without requiring direct access to patient data.

Despite its benefits, federated learning faces challenges such as ensuring stable convergence across different silos and handling heterogeneous data. In endovascular interventions, these challenges arise primarily from variations in data collected from different sources, leading to domain gaps in X-ray images. As illustrated in Figure 1, X-ray images from various endovascular datasets differ significantly. Additionally, due to privacy restrictions, datasets containing real human X-ray images tend to be smaller compared to those obtained from animal models, silicon phantoms, or simulated environments.

Vessel12

Phantom X-ray

DRIVE

Animal X-ray

Image 3

Human X-ray

Image 3

Simulation X-ray

Figure 1. Different types of endovascular X-ray data. We aim to train a foundation model which can leverage diverse X-ray data from multiple hospitals (silos)

In this work, our goal is to train a foundation model using diverse endovascular datasets with federated learning. Since we aim to use all possible endovascular data (i.e., from humans, animals, phantoms, etc.), there is an unseen data problem between silos. To tackle this problem, we propose the Federated Endovascular Foundation Model (FedEFM), a new distillation algorithm using differentiable Earth Mover's Distance (EMD). Once trained, FedEFM provides crucial initializations for downstream tasks, thereby enhancing task-specific performance. Our approach outperforms existing methods and holds significant potential for application in robotic-assisted endovascular surgery, while effectively maintaining data privacy.

Our contribution can be summarized as below:

  • We propose a new method to train a federated endovascular foundation model with unseen data using a multishot distillation technique.
  • We propose the Multishot Foundation Federated Distillation algorithm (MFD), powered by differentiable Earth Mover's Distance, to address the unseen label corpus issue and ensure the feasibility of learning for the foundation model.
  • We collect new datasets for training endovascular foundation models. Our proposed model is verified under several downstream tasks.

Related Works

Endovascular Intervention

Endovascular intervention has greatly improved the treatment of vascular diseases such as aneurysms and embolisms using X-ray fluoroscopy. However, these procedures still encounter challenges, including low contrast, complex anatomical structures, and the scarcity of expert-annotated data. Recent studies have aimed to address these issues through advancements in imaging technology and machine learning techniques. For instance, researchers have proposed an enhanced U-Net-based approach for localizing guidewire endpoints in X-ray images. More recently, FW-Net has been introduced to improve catheter segmentation by utilizing frame-to-frame temporal consistency. While many studies focus on conventional tasks, few have explored the development of foundation models for endovascular intervention. A key obstacle is the strict requirement for patient data privacy, which significantly limits the ability to train such models.

Figure 2. Endovascular procedure.

Federated Learning

Federated learning has emerged as a promising solution for training machine learning models on decentralized data while preserving data privacy. This approach is especially valuable in the medical field, where data sensitivity and confidentiality are critical concerns. Although numerous studies have investigated the use of federated learning for training foundation models in healthcare, privacy challenges can be mitigated but not entirely eliminated. A major hurdle is the heterogeneity and non-IID nature of medical data across different institutions. Moreover, the issue of unseen data-where certain data types appear in some datasets but are missing in others—further complicates model training and generalization.

Figure 3. Decentralized federated learning setup

Knowledge Distillation with Earth Mover’s Distance.

Knowledge distillation involves transferring knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). In the context of federated learning, distillation can be used to enable local models to learn from aggregated global models without sharing raw data. The Earth Mover's Distance (EMD), also known as the Wasserstein distance, measures the dissimilarity between two probability distributions and is particularly useful for comparing distributions that do not have overlapping support. By leveraging the differentiable EMD, it is possible to align distributions of labels across different models, facilitating better model convergence and knowledge transfer. In this paper, we leverage EMD within a distillation training process to address the unseen label data issue when training endovascular foundation models in federated scenarios.

Next

In the next part, we will explore the process of collecting and managing our dataset for training the foundation model.

Guide3D A Bi-planar X-ray Dataset for 3D Shape Reconstruction (Part 4)

We evaluate our proposed dataset, Guide3D, through a structured experimental analysis, as follows: i) initially, we assess the dataset’s validity, focusing on reprojection errors and their distribution across the dataset to understand its accuracy; ii) we then explore the applicability of Guide3D in a 3D reconstruction task ; and iii) finally, we benchmark several segmentation algorithms to assess their performance on Guide3D, providing insights into the dataset’s utility.

Dataset Validation

Our analysis revealed a non-uniform distribution of reprojection errors across the dataset, with the highest variability and errors concentrated at the proximal end of the guidewire reconstructions. Figure 1 shows the reprojection error patterns for both Camera A and Camera B. For Camera A, mean errors increase from approximately 6 px to a peak of 20 px, with standard deviations rising from 5 px to 11 px, indicating growing inaccuracies and variability over time. Significant fluctuations around indices 25 to 27 highlight periods of particularly high error. For Camera B, mean errors exhibit an initial peak of 9 px at index 1, followed by fluctuations that decrease towards the end. The standard deviations for Camera B start high at 11 px and decrease over time, reflecting a pattern of high initial variability that stabilizes later. These patterns are consistent with the inherent flexibility of the guidewire, which can form complex shapes such as loops.

Figure 1. Guidewire Reconstruction Error Analysis: (Left) Illustrates the distribution of reprojection errors, noting higher variability and peak errors in the mid-sections and reduced errors at the extremities. (Right) Presents the results of reconstruction validation.

Furthermore, we conducted a validation procedure using CathSim, incorporating the aortic arch model described in next subsection and a guidewire of similar diameter and properties. For sampling, we employed the soft actor-critic (SAC) algorithm with segmented guidewires and kinematic data, producing realistic validation samples. Evaluation metrics included maximum Euclidean distance (MaxED) at 2.880 ± 0.640 mm, mean error in tip tracking (METE) at 1.527 ± 0.877 mm, and mean error related to the robot’s shape (MERS) at 0.001 ± 0.000. These results demonstrate the method’s precision.

Guidewire Prediction Results

We now demonstrate the capability of the introduced network and highlight the importance of the proposed dataset. We examine the network prediction in the following manner: 1) we first conduct an analysis between the predicted and reconstructed curve by employing piecewise metrics, and 2) we showcase the reprojection error.

Shape Prediction Errors: Table 1 presents the comparison of different metrics for shape prediction accuracy. We quantify the shape differences using the following metrics: 1) Maximum Euclidean Distance (MaxED), 2) Mean Error in Tip Tracking (METE), and 3) Mean Error in Robot Shape (MERS). For all the metrics, the shape of the guidewire, represented as a 3D curve C(u)\mathbf{C}(u), is sampled at equidistant Δu\Delta u intervals along the arclength parameter uu. Therefore, the metrics represent the pointwise discrepancies between the two shapes along the curve’s arclength.

The results indicate that the spherical representation consistently outperforms the Cartesian representation across all metrics. Specifically, the Maximum Euclidean Distance (MaxED) shows a lower error in the spherical representation (6.88 ± 5.23 mm) compared to the Cartesian representation (10.00 ± 4.64 mm). Similarly, the Mean Error in Tip Tracking (METE) is significantly lower in the spherical representation (3.28 ± 2.59 mm) than in the Cartesian representation (6.93 ± 3.94 mm). For the Mean Error in Robot Shape (MERS), the spherical representation also demonstrates a reduced error (4.54 ± 3.67 mm) compared to the Cartesian representation (5.33 ± 2.73 mm). Lastly, the Fréchet distance shows a smaller error for the spherical representation (6.70 ± 5.16 mm) compared to the Cartesian representation (8.95 ± 4.37 mm). These results highlight the advantage of using the spherical representation for more accurate shape prediction.

Table 1 Shape Comparison (mm).

Shape Comparison Visualization: Figure 2a showcases two 3D plots from different angles, comparing the ground truth guidewire shape to the predicted shape by the network. The network demonstrates its capability to accurately predict the guidewire shape, even in the presence of a loop and self-obstruction in the image. The predicted shape aligns closely with the actual configuration of the guidewire. Notably, the proximal end manifests a more substantial error relative to the nominal error seen at the distal end. Discrepancies from the authentic guidewire shape span from a mere 2 mm at the distal end to a noticeable 5 mm at the proximal end. Impressively, the network evidences its capability to accurately predict the guidewire’s shape using only consecutive singular plane images. Subsequently, the 3D points are reprojected onto the original images, as illustrated in Figure 2b.

Figure 2. The figure illustrates the reconstruction similarity of the guidewire when reprojected onto the images. It demonstrates the network’s capability to accurately predict the guidewire shape, even in the presence of noticeable angles, highlighting the robustness of the prediction model.

Segmentation Results

We demonstrate Guide3D’s potential to advance guidewire segmentation research by evaluating the performance of three state-of-the-art network architectures: UNet (learning rate: 1×1051 \times 10^{-5}, 135 epochs), TransUnet (integrating ResNet50 and Vision Transformer (ViT-B-16), learning rate: 0.01, 199 epochs), and SwinUnet (Swin Transformer architecture, learning rate: 0.01, 299 epochs). Performance metrics included the Dice coefficient (DiceM), mean Intersection over Union (mIoU), and Jaccard index, detailed in Table 2. The results indicate that UNet achieved a DiceM of 92.25, mIoU of 36.60, and Jaccard index of 86.57. TransUnet outperformed with a DiceM of 95.06, mIoU of 41.20, and Jaccard index of 91.10. SwinUnet recorded a DiceM of 93.73, mIoU of 38.58, and Jaccard index of 88.55. These findings benchmark the dataset’s performance and suggest potential for future enhancements. Despite these promising results, the presence of loops and occlusions within the guidewire indicates that polyline prediction could significantly improve task utility.

Table 2 Segmentation Results.

Discussion and Conclussion

This paper introduces a new dataset, Guide3D, for segmentation and 3D reconstruction of flexible, curved endovascular tools. Extensive experiments demonstrate the dataset’s value; yet several limitations must be acknowledged. Firstly, our dataset lacks clinical real human data due to the complexity and regulatory challenges of acquiring such data. Our standardized platform, however, aims to enable further research, providing a stepping stone towards clinical practice.

Additionally, the dataset primarily focuses on synthetic and experimental scenarios, which may not fully capture the variability and unpredictability of real-world clinical environments. While this controlled setting aids initial algorithm development and benchmarking, further validation with clinical data is necessary to ensure the robustness and generalizability of the proposed methods.

Moreover, the guidewire’s flexibility and the presence of loops and occlusions present significant challenges for segmentation and reconstruction tasks. Our dataset includes these complexities to push the boundaries of current methodologies, but future work should explore more advanced techniques.

Our dataset accommodates both video and image-based approaches, providing a versatile resource to facilitate the translation of these technologies into clinical settings. Our objective is to bridge the disparity between research developments and clinical application by establishing a standardized framework for evaluating the efficacy of various methodologies. Our code and dataset will be made publicly available.

Guide3D A Bi-planar X-ray Dataset for 3D Shape Reconstruction (Part 3)

Utilizing the Guide3D dataset, we build a benchmark for the shape prediction task, a critical component in endovascular intervention. Accurate shape prediction of the guidewire is essential for successful navigation and intervention. Here, we introduce a novel shape prediction network designed to predict the guidewire shape from a sequence of monoplanar images. This approach leverages deep learning to learn spatio-temporal correlations from a static camera observing a dynamic scene. Unlike conventional reconstruction methods that require biplanar images, our network uses a sequence of images to extract temporal information, allowing it to map a single image IA\mathbf{I}_A to the 3D guidewire curve C(u)\mathbf{C}(\mathbf{u}). By adopting this deep learning approach, we aim to simplify the shape prediction process while maintaining high accuracy. This method has the potential to enhance endovascular navigation by providing real-time, accurate predictions of the guidewire shape, ultimately improving procedural outcomes and reducing reliance on specialized equipment.

Network Key Components: The figure illustrates the essential components of the proposed model. a) Spherical coordinates (r,θ,ϕ)(r, \theta, \phi) are used for predicting the guidewire shape. b) The model predicts the 3D shape of a guidewire from image sequences It\mathbf{I}_t. A Vision Transformer (ViT) extracts spatial features zt\mathbf{z}_t, which a Gated Recurrent Unit (GRU) processes to capture temporal dependencies, producing hidden states ht\mathbf{h}_t. The final hidden state drives three prediction heads: the Tip Prediction Head for the 3D tip position pR3\mathbf{p} \in \mathbb{R}^3, the Spherical Offset Prediction Head for coordinate offsets (Δϕ,Δθ)(\Delta \phi, \Delta \theta), and the Stop Prediction Head for terminal point probability S\mathbf{S}.

Spherical Coordinates Representation

Predicting 3D points directly can be challenging due to the high degree of freedom. To mitigate this, we use spherical coordinates, which offer significant advantages over Cartesian coordinates for guidewire shape prediction. Spherical coordinates, as represented in Fig. 1a, are defined by the radius rr, polar angle θ\theta, and azimuthal angle ϕ\phi. They provide a more natural representation for the position and orientation of points along the guidewire, which is typically elongated and curved.

Mathematically, a point in spherical coordinates (r,θ,ϕ)(r, \theta, \phi) can be converted to Cartesian coordinates (x,y,z)(x, y, z) using the transformations:

x=rsinθcosϕ,y=rsinθsinϕ,z=rcosθ.x = r \sin \theta \cos \phi, \quad y = r \sin \theta \sin \phi, \quad z = r \cos \theta.

This conversion simplifies the modeling of angular displacements and rotations, as spherical coordinates directly encode directional information.

Predicting angular displacements (Δθ,Δϕ)(\Delta \theta, \Delta \phi) relative to a known radius rr aligns with the physical constraints of the guidewire, facilitating more accurate and interpretable shape predictions. By predicting an initial point (tip position) and representing subsequent points as offsets in Δϕ\Delta \phi and Δθ\Delta \theta while keeping rr fixed, this method simplifies shape comparison and reduces the parameter space. This approach enhances the model’s ability to capture the guidewire’s spatial configuration and improves overall prediction performance.

Network Architecture

The proposed model (shown in Fig. 1b) addresses the problem of predicting the 3D shape of a guidewire from a sequence of images. Each image sequence captures the guidewire from different time steps IA,t\mathbf{I}_{A,t}, and the goal is to infer the continuous 3D shape Ct(ut)\mathbf{C}_t(\mathbf{u}_t). This many-to-one prediction task is akin to generating a variable-length sequence from variable-length input sequences, a technique commonly utilized in fields such as machine translation and video analysis.

To achieve this, the input pipeline consists of a sequence of images depicting the guidewire. A Vision Transformer (ViT), pre-trained on ImageNet, is employed to extract high-dimensional spatial feature representations from these images. The ViT generates feature maps ztR\mathbf{z}_t \in \mathbb{R}. These feature maps are then fed into a Gated Recurrent Unit (GRU) to capture the temporal dependencies across the image sequence. The GRU processes the feature maps zt\mathbf{z}_t from consecutive time steps, producing a sequence of hidden states ht\mathbf{h}_t. Formally, the GRU operation at time step tt is defined as:

ht=GRU(zt,ht1).\mathbf{h}_t = \text{GRU}(\mathbf{z}_t, \mathbf{h}_{t-1}).

The final hidden state ht\mathbf{h}_t from the GRU is used by three distinct prediction heads, each tailored for a specific aspect of the guidewire shape prediction: the Tip Prediction Head, responsible for predicting the 3D coordinates of the guidewire’s tip through a fully connected layer that maps the hidden state ht\mathbf{h}_t to a Cartesian anchoring point pR3\mathbf{p} \in \mathbb{R}^3; the Spherical Offset Prediction Head, which predicts the spherical coordinate offsets (Δϕ,Δθ)(\Delta \phi, \Delta \theta) for points along the guidewire with a fixed radius rr; and the Stop Prediction Head, which outputs the probability distribution indicating the terminal point of the guidewire by using a softmax layer to produce a probability tensor S\mathbf{S}, where each element Sj\mathbf{S}_j indicates the probability of the jj-th point being the terminal point.

Loss Function

The custom loss function for training the model combines multiple components to handle the point-wise tip error, variable guidewire length (stop criteria), and tip position predictions. The overall loss function Ltotal\mathcal{L}_{\text{total}} is defined as:

Ltotal=1Ni=1N(λtipp^ipi2+λoffset((ϕ^iϕi)2+(θ^iθi)2)+λstop(silog(s^i)(1si)log(1s^i)))\mathcal{L}_{\text{total}} = \frac{1}{N} \sum_{i=1}^N \bigg( \lambda_{\text{tip}} \left \| \hat{\mathbf{p}}_i - \mathbf{p}_i \right \|^2 + \lambda_{\text{offset}} \big( (\hat{\boldsymbol{\phi}}_i - \boldsymbol{\phi}_i)^2 + (\hat{\boldsymbol{\theta}}_i - \boldsymbol{\theta}_i)^2 \big) + \lambda_{\text{stop}} \big( -\mathbf{s}_i \log (\hat{\mathbf{s}}_i) - (1 - \mathbf{s}_i) \log (1 - \hat{\mathbf{s}}_i) \big) \bigg)

where NN is the number of samples, and λtip\lambda_{\text{tip}}, λoffset\lambda_{\text{offset}}, and λstop\lambda_{\text{stop}} are weights that balance the contributions of each loss component. The tip prediction loss (Ltip\mathcal{L}_{\text{tip}}) uses mean squared error (MSE) to ensure accurate 3D tip coordinates. The spherical offset loss (Loffset\mathcal{L}_{\text{offset}}) also uses MSE to align predicted and ground truth angular offsets, capturing the guidewire’s shape. The stop prediction loss (Lstop\mathcal{L}_{\text{stop}}) employs binary cross-entropy (BCE) to accurately predict the guidewire’s endpoint.

Training Details

The model was trained end-to-end using the loss from Equation above. The NAdam optimizer was used with an initial learning rate of 1×1041 \times 10^{-4}. Additionally, a learning rate scheduler was employed to adjust the learning rate dynamically based on the validation loss. Specifically, the ReduceLROnPlateau scheduler was configured to reduce the learning rate by a factor of 0.1 if the validation loss did not improve for 10 epochs. The model was trained for 400 epochs, with early stopping based on the validation loss to further prevent overfitting.

Next

In the next part, we will validate the effectiveness of Guidewire Shape Prediction dataset and methodology.

Guide3D A Bi-planar X-ray Dataset for 3D Shape Reconstruction (Part 2)

We propose the Guid3D Dataset, a comprehensive resource specifically designed to advance 3D reconstruction and segmentation in endovascular navigation. This dataset addresses key limitations in the field, such as the scarcity of high-quality, publicly accessible datasets, by providing a diverse collection of real and synthetic imaging data. Guid3D includes detailed annotations for guidewire and catheter segmentation, alongside multi-view fluoroscopic data that supports accurate 3D modeling. By offering a standardized platform for algorithm development and evaluation, Guid3D aims to bridge the gap between research and clinical practice, facilitating improvements in precision, visualization, and tool tracking during endovascular procedures. Through this dataset, we seek to accelerate innovation in medical imaging, contributing to safer and more effective interventions.

Data Collection Setup

X-ray System. Our setup employed a bi-planar X-ray system equipped with 60 kW Epsilon X-ray generators and 16-inch image intensifier tubes by Thales, featuring dual focal spot Varian X-ray tubes for high-definition imaging. The system included Ralco automatic collimators for precise alignment and exposure, with calibration achieved through the use of acrylic mirrors and geometric alignment grids.

Anatomical Models. We utilized a half-body vascular phantom model from Elastrat Sarl Ltd., Switzerland, enclosed in a transparent box and integrated into a closed water circuit to simulate blood flow. Made from soft silicone and equipped with compact continuous flow pumps, it replicates human blood flow dynamics. The design is based on detailed postmortem vascular casts, ensuring anatomical accuracy reflective of human vasculature, facilitating realistic vascular simulations.

Figure 1. Dataset Overview: Guide3D contains 8,746 manually annotated frames from two views for 3D reconstruction (left), from which the reconstruction is derived (right).

Surgical Tools. To enhance our dataset, we navigated complex vascular structures using two types of guidewires commonly used in real-world endovascular surgery. The first, the Radifocus™ Guide Wire M Stiff Type (Terumo Ltd.), is made from nitinol with a polyurethane-tungsten coating. It measures 0.89 mm in diameter and 260 cm in length, with a 3 cm angled tip, designed for seeking, dissecting, and crossing lesions. The second, the Nitrex Guidewire (Nitrex Metal Inc.), also made of nitinol, features a gold-tungsten straight tip for enhanced radiopacity in fluoroscopic visualization. It has a diameter of 0.89 mm and a length of 400 cm, with a 15 cm tip, and is generally used for accessing or maintaining position during catheter exchanges. Both guidewires were selected to reflect real-world usage and to diversify the data in our dataset.

Figure 2. Materials: a) Overall setup & endovascular phantom, b) Radifocus (angled) guidewire. and c) Nitrex (straight) guidewire.

Data Acquisition, Labeling, and Statistics

Using the materials described in Subsection 3.1, we compiled a dataset of 8,746 high-resolution samples (1,024 × 1,024 pixels). This dataset includes 4,373 paired instances, both with and without a simulated blood flow medium. Specifically, it consists of 6,136 samples from the Radifocus guidewire and 2,610 from the Nitrex guidewire, providing a solid foundation for automated guidewire tracking in bi-planar scanner images. Manual annotation was carried out using the Computer Vision Annotation Tool (CVAT), where polylines were created to accurately track the dynamic path of the guidewires. The polyline representation was chosen because the guidewire's structure often results in overlapping sections, making a segmentation mask unsuitable. In contrast, a polyline effectively captures the looping nature of the guidewire, offering greater accuracy.

As shown in Table 1, the dataset includes 3,664 instances of angled guidewires with fluid and 484 without, while straight guidewires are represented by 2,472 instances with fluid and 2,126 without. This distribution reflects a variety of procedural contexts. All 8,746 images in the dataset are accompanied by manual segmentation ground truth, facilitating the development of algorithms that require segmentation maps as reference data.

Table 1. Dataset Composition Overview.

Calibration

We extract the camera parameters using a traditional undistortion and calibration method. Undistortion is first achieved with a local weighted mean (LWM) algorithm, using a perforated steel sheet with a hexagonal pattern as a framing reference, and applying a blob detection algorithm to precisely identify distortion points. This approach establishes correspondences between distorted and undistorted positions, allowing for accurate distortion correction.

Following this, a semi-automatic calibration step is performed for marker identification, and the random sampling consensus (RANSAC) method is used to ensure robustness in computing the projection matrix and deriving the intrinsic and extrinsic camera parameters. The calibration process is further refined through direct linear transformation (DLT) and non-linear optimization, utilizing multiple poses of the calibration object to optimize the overall camera setup. Figure 3 illustrates the calibration process.

Figure 3. Fluoroscopic Calibration: a) Undistortion grid application, and b) Point identification on calibration frame.

Guidewire Reconstruction

Given polyline representations of a curve in both planes, the reconstruction process begins by parameterizing these curves using B-Spline interpolation. Each curve is expressed as a function of the cumulative distance along its path. Let CA(uA)\mathbf{C}_A(\mathbf{u}_A) and CB(uB)\mathbf{C}_B(\mathbf{u}_B) represent the parameterized B-Spline curves in their respective planes, where uA\mathbf{u}_A and uB\mathbf{u}_B are the normalized arc-length parameters. The corresponding uB\mathbf{u}_B for a given uA\mathbf{u}_A is found using epipolar geometry. Once the corresponding points CA(uAi)\mathbf{C}_A(\mathbf{u}_A^i) and CB(uBi)\mathbf{C}_B(\mathbf{u}_B^i) are identified, their 3D coordinates Pi\mathbf{P}^i are computed by triangulation, resulting in a set of 3D points {Pi}i=1M\{\mathbf{P}^i\}_{i=1}^{M}, where MM is the total number of sampled points. This effectively reconstructs the original curve in 3D space.

To retrieve the fundamental matrix F\mathbf{F}, which describes the relationship between points in Image A (IA\mathbf{I}_A) and Image B (IB\mathbf{I}_B), the condition xBTFxA=0\mathbf{x}_B^T \mathbf{F} \mathbf{x}_A = 0 must hold for corresponding points xA\mathbf{x}_A in IA\mathbf{I}_A and xB\mathbf{x}_B in IB\mathbf{I}_B. Using the projection matrices PA\mathbf{P}_A and PB\mathbf{P}_B derived from the calibration process, the fundamental matrix can be calculated as follows:

F=[eB]×PBPA+\mathbf{F} = [\mathbf{e}_B]_\times \mathbf{P}_B \mathbf{P}_A^+

Here, eB\mathbf{e}_B is the epipole in Image B, defined as eB=PBCA\mathbf{e}_B = \mathbf{P}_B \mathbf{C}_A, with CA\mathbf{C}_A being the camera center of PA\mathbf{P}_A. The skew-symmetric matrix of the epipole eB\mathbf{e}_B is represented by:

[eB]×=[0eB3eB2eB30eB1eB2eB10][\mathbf{e}_B]_\times = \begin{bmatrix} 0 & -e_{B3} & e_{B2} \\ e_{B3} & 0 & -e_{B1} \\ -e_{B2} & e_{B1} & 0 \end{bmatrix}

Where eB=(eB1,eB2,eB3)T\mathbf{e}_B = (e_{B1}, e_{B2}, e_{B3})^T, and PA+\mathbf{P}_A^+ is the pseudoinverse of the projection matrix PA\mathbf{P}_A. The fundamental matrix F\mathbf{F} encapsulates the epipolar geometry between the two views, ensuring that corresponding points xA\mathbf{x}_A and xB\mathbf{x}_B lie on their respective epipolar lines.

The matching phase begins by uniformly sampling points along the curve CA(uA)\mathbf{C}_A(u_A) at intervals ΔuA\Delta u_A. For each sampled point xA=CA(uA)x_A = \mathbf{C}_A(u_A), we project the epiline lB=FxAl_B = F x_A into Image B. We then determine the intersection of the epiline lBl_B with the curve CB(uB)\mathbf{C}_B(u_B), thereby obtaining the corresponding parameter uBu_B for each uAu_A.

Due to errors in the projection matrices PAP_A and PBP_B, there are instances where the epiline lBl_B does not intersect with any part of the curve CB\mathbf{C}_B. To address this, we fit a monotonic function fA(uA)uBf_A(u_A) \rightarrow u_B using a Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), thus interpolating the missing intersections. The matching process is visualized in Fig. 4.

Figure 4.Point Matching Process. Sampled points from image IAI_A (CA(uA)\mathbf{C}_A(u_A)) and their corresponding epilines lAl_A on image IBI_B are matched with their counterparts CB(uB)\mathbf{C}_B(u_B). The epilines for CB(uB)\mathbf{C}_B(u_B) are then computed and displayed on image IAI_A.

Utility of Guide3D Dataset for the Research Community

Guide3D advances endovascular imaging by providing a bi-planar fluoroscopic dataset for segmentation and 3D reconstruction, serving as an open-source benchmark. It enables precise algorithm comparisons for segmentation and facilitates method development in 3D reconstruction through the use of bi-planar imagery. With video data, Guide3D supports video-based methods, leveraging temporal dimensions for dynamic analysis. This enriches the segmentation and reconstruction capabilities, while also aligning with the procedural nature of endovascular interventions. This versatility highlights Guide3D's pivotal role in advancing endovascular imaging.

Next

In the next part, we will explore Guidewire Shape Prediction methodology.

Guide3D A Bi-planar X-ray Dataset for 3D Shape Reconstruction (Part 1)

Endovascular surgical tool reconstruction represents an important factor in advancing endovascular tool navigation, which is an important step in endovascular surgery. However, the lack of publicly available datasets significantly restricts the development and validation of novel machine learning approaches. Moreover, due to the need for specialized equipment such as biplanar scanners, most of the previous research employs monoplanar fluoroscopic technologies, hence only capturing the data from a single view and significantly limiting the reconstruction accuracy.

To bridge this gap, we introduce, a bi-planar X-ray dataset for 3D reconstruction. The dataset represents a collection of high resolution bi-planar, manually annotated fluoroscopic videos, captured in real-world settings. Validating our dataset within a simulated environment reflective of clinical settings confirms its applicability for real-world applications. Furthermore, we propose a new benchmark for guidewrite shape prediction, serving as a strong baseline for future work. The proposal not only addresses an essential need by offering a platform for advancing segmentation and 3D reconstruction techniques but also aids the development of more accurate and efficient endovascular surgery interventions.

Introduction

Minimally invasive surgery has revolutionized endovascular interventions, offering less invasive options with shorter recovery times. The success of these procedures depends on the precise navigation and manipulation of instruments such as guidewires and catheters. Typically, 2D visualization methods are used for guidance, with monoplanar fluoroscopy being the most common due to its minimal disruption to surgical workflows and relatively affordable cost. However, despite their widespread use, conventional imaging techniques have significant limitations, with one of the primary challenges being the lack of depth perception. This issue complicates the accurate visualization of surgical instruments, increasing the risk of excessive contact with arterial walls, which can compromise patient safety and the effectiveness of the procedure.

In endovascular interventions, depth perception is largely achieved through multi-view imaging systems, such as biplanar scanners, which allow shape reconstruction by combining images from multiple angles and employing epipolar geometry-based reconstruction. However, two major challenges hinder the broader adoption and effectiveness of these systems: (i) the difficulty of accurately segmenting images for successful shape reconstruction, exacerbated by the scarcity of datasets needed to evaluate segmentation methods, and (ii) the limited availability of specialized biplanar scanners in clinical settings due to their high cost. These challenges underscore the critical need for comprehensive datasets to enhance segmentation algorithm accuracy and improve guidewire reconstruction techniques, facilitating the development of more versatile imaging technologies.

Figure 1. Guide3D dataset contains 8,746 manually annotated frames from two views for 3D reconstruction.

In this paper, we introduce Guid3D, a dataset designed to advance 3D reconstruction in endovascular navigation. Guid3D provides a standardized platform for the development and evaluation of algorithms. With a comprehensive dataset that includes manual annotations for segmentation and tools for effective 3D visualization, Guid3D is intended to drive innovation and improvement in endovascular intervention. Furthermore, the inclusion of video-based biplanar fluoroscopic data allows for the exploration of temporal dynamics, such as using optical flow networks. Guid3D seeks to bridge the gap between research innovations and clinical applications, addressing key challenges in endovascular procedures.

Related Works

Endovascular Datasets.

Datasets play a crucial role in advancing endovascular navigation by providing essential resources for the development, evaluation, and enhancement of algorithms. These datasets, derived from various imaging modalities such as mono X-ray, 3D ultrasound, and 3D MRI, encompass both real and synthetic images, facilitating diverse applications in the medical field.

Mono X-ray datasets, while prevalent, often fall short in providing the necessary detail required for accurate 3D reconstruction, which is critical for effective surgical navigation. The inherent limitations of 2D imaging techniques make it challenging to fully capture the complexity of anatomical structures during procedures. In contrast, 3D imaging modalities like 3D ultrasound and 3D MRI offer more comprehensive views, enabling better depth perception and improved visualization of surgical tools and surrounding tissues.

Despite the importance of these datasets, there remains a significant gap in the availability of comprehensive, publicly accessible datasets specifically designed for tool segmentation and 3D reconstruction. This scarcity hampers progress in developing robust algorithms capable of accurately interpreting complex medical images. The lack of diverse and high-quality datasets also limits the ability of researchers to train and validate their algorithms effectively, often leading to suboptimal performance in clinical scenarios.

Furthermore, creating high-quality datasets is not merely a technical challenge; it requires collaboration among various stakeholders, including clinicians, radiologists, and data scientists. Such collaboration is essential to ensure that the datasets reflect real-world clinical conditions and include diverse patient populations. Expanding the availability of well-annotated datasets is vital for fostering innovation and advancing the field of endovascular surgery.

Figure 2. Endovascular Dataset Explaination.

Catheter and Guidewire Segmentation.

The segmentation of endovascular tools, particularly guidewires and catheters, is an evolving field that heavily relies on the availability and quality of datasets. Previous studies have often used synthetic and semi-synthetic data to address the challenges posed by the limited availability of real-world datasets. Researchers have employed manually annotated datasets from 2D X-ray and 3D MRI modalities to train segmentation models. Additionally, the effectiveness of synthetic datasets has been demonstrated in improving model efficiency.

The advent of deep learning techniques, especially U-Net architectures, has significantly enhanced the accuracy of segmentation and tracking for these surgical instruments. This advancement has led to the development of fully automated segmentation frameworks that utilize extensively annotated data and incorporate unsupervised techniques, such as optical flow. However, the absence of a public, standardized dataset for method comparison continues to impede the advancement and assessment of scientific progress in this area.

Figure 3. Interventional Microcatheters.

3D Reconstruction.

Improving the accuracy of 3D reconstruction in endovascular procedures plays a crucial role in achieving better clinical outcomes by enhancing catheter navigation through advanced visualization and precise tracking. Advances in fluoroscopic imaging technology have led to more accurate positioning of devices. Various algorithms have been developed to facilitate this process, employing techniques such as elastic grid registration and epipolar geometry for 3D reconstruction from biplane angiography. Additionally, automatic catheter detection methods utilizing triangulation and graph-search algorithms have been applied in electrophysiology studies to improve reconstruction outcomes.

Research has demonstrated the importance of accurate 3D models for navigation within both complex and single-view vascular architectures, highlighting the value of biplanar data. However, the limited availability of comprehensive, publicly accessible datasets for the development and validation of algorithms in 3D reconstruction poses a significant challenge to technological progress and clinical application. This situation underscores the critical need for specialized datasets to promote ongoing innovation in the reconstruction of endovascular tools.

Figure 4. Guideware Calibration.

Next

In the next part, we will dive deeply into how to conduct a dataset for 3D shape reconstruction.

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge (Part 3)

In this part, we will show the effectivness and the ablation studies of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge.

Dataset

As mentioned in [1], we train method on two types of scans: Liver CT scans and Brain MRI scans.

For Liver CT scans, we use 5 datasets:

  1. LiTS contains 131 liver segmentation scans.
  2. MSD has 70 liver tumor CT scans, 443 hepatic vessels scans, and 420 pancreatic tumor scans.
  3. BFH is a smaller dataset with 92 scans.
  4. SLIVER is a challenging dataset with 20 liver segmentation scans and annotated by 3 expert doctors.
  5. LSPIG (Liver Segmentation of Pigs) contains 17 pairs of CT scans from pigs, provided by the First Affiliated Hospital of Harbin Medical University.

For Brain MRI scans, we use 4 datasets: 1. ADNI contains 66 scans. 2. ABIDE contains 1287 scans. 3. ADHD contains 949 scans. 4. LPBA has 40 scans, each featuring a segmentation ground truth of 56 anatomical structures.

Baselines

We compare LDR ALDK method with the following recent deformable registration methods:

  • ANTs SyN and Elastix B-spline are methods that find an optimal transformation by iteratively update the parameters of the defined alignment.
  • VoxelMorph predicts a dense deformation in an unsupervised manner by using deconvolutional layers.
  • VTN is an end-to-end learning framework that uses convolutional neural networks to register 3D medical images, especially large displaced ones.
  • RCN is a recent recursive deep architecture that utilizes learnable cascade and performs progressive deformation for each warped image.

Results

Table 1 summarizes the overall performance, testing speed, and the number of parameters compared with recent state-of-the-art methods in the deformable registration task. The results clearly show that Light-weight Deformable Registration network (LDR) accompanied by Adversarial Learning with Distilling Knowledge (ALDK) algorithm significantly reduces the inference time and the number of parameters during the inference phase. Moreover, the method achieves competitive accuracy with the most recent highly performed but expensive networks, such as VTN or VoxelMorph. We notice that this improvement is consistent across all experiments on different datasets SLIVER, LiTS, LSPIG, and LPBA.

In particular, we observe that on the SLIVER dataset the Dice score of best model with 3 cascades (3-cas LDR + ALDK) is 0.3% less than the best result of 3-cas VTN + Affine, while inference speed is ?21 times faster on a CPU and the parameters used during inference is ~8 times smaller. Including benchmarking results in three other datasets, i.e., LiTS, LSPIG, and LPBA, light-weight model only trades off an average of 0.5% in Dice score and 1.25% in Jacc score for a significant gain of speed and a massive reduction in the number of parameters. We also notice that method is the only work that achieves the inference time of approximately 1s on a CPU. This makes method well suitable for deployment as it does not require expensive GPU hardware for inference.

Fig-1

Table 1: COMPARISON AMONG LDR ALDK MODEL WITH RECENT APPROACHES.

Ablation Study

Effectiveness of ALDK. Table 2 summarizes the effectiveness of Adversarial Learning with Distilling Knowledge (ALDK) when being integrated into the light-weight student network. Note that LDR without ALDK is trained using only the reconstruction loss in an unsupervised learning setup. From this table, we clearly see that ALDK algorithm improves the Dice score of the LDR tested in the SLIVER dataset by 3.4%, 4.0%, and 3.1% for 1-cas, 2-cas, and 3-cas setups, respectively. Additionally, using ALDK also increases the Jacc score by 5.2%, 4.9%, and 3.9% for 1-cas LDR, 2-cas LDR, and 3-cas LDR. These results verify the stability of adversarial learning algorithm in the inference phase, under the differences evaluation metrics, as well as the number of cascades setups. Furthermore, Table 2 also clearly shows the effectiveness and generalization of ALDK when being applied to the student network. Since the deformations extracted from the teacher are used only in the training period, adversarial learning algorithm fully maintains the speed and the number of parameters for the light-weight student network during inference. All results indicate that student network incorporated with the adversarial learning algorithm successfully achieves the performance goal, while maintaining the efficient computational cost of the light-weight setup.

Fig-2

Table 2: COMPARISON AMONG LDR ALDK MODEL WITH RECENT APPROACHES.

Accuracy vs. Complexity. Figure 1 demonstrates the experimental results from the SLIVER dataset between LDR + ALDK and the baseline VTN under multiple recursive cascades setup on both CPU and GPU. On the CPU (Figure 1-a), in terms of the 1-cascade setup, the Dice score of method is 0.2% less than VTN while the speed is ~15 times faster. The more the number of cascades is leveraged, the higher the speed gap between LDR + ALDK and the baseline VTN, e.g. the CPU speed gap is increased to ~21 times in a 3-cascades setup. We also observe the same effect on GPU (Figure 1-b), where method achieves slightly lower accuracy results than VTN, while clearly reducing the inference time. These results indicate that LDR + ALDK can work well with the teacher network to improve the accuracy while significantly reducing the inference time on both CPU and GPU in comparison with the baseline VTN network.

Fig-3

Figure 1:Plots of Dice score and Inference speed with respect to the number of cascades of the baseline Affine + VTN and LDR + ALDK. (a) for CPU speed and (b) for GPU speed. Note that results are reported for the SLIVER dataset; bars represent the CPU speed; lines represent the Dice score. All methods use an Intel Xeon E5-2690 v4 CPU and Nvidia GeForce GTX 1080 Ti GPU for inference.

Visualization

Figure 2 illustrates the visual comparison among 1-cas LDR, 1-cas LDR + ALDK, and the baseline 1-cas RCN. Five different moving images in a volume are selected to apply the registration to a chosen fixed image. It is important to note that though the sections of the warped segmentations can be less overlap with those of the fixed one, the segmentation intersection over union is computed for the volume and not the sections. In the segmented images in Figure 2, besides the matched area colored by white, we also marked the miss-matched areas by red for an easy-to-read purpose.

From Figure 2, we can see that the segmentation resutls of 1-cas LDR network without using ALDK (Figure 2-a) contains many miss-matched areas (denoted in red color). However, when we apply ALDK to the student network, the registration results are clearly improved (Figure 2-b). Overall, LDR + ALDK visualization results in Figure 2-b are competitive with the baseline RCN network (Figure 2-c). This visualization confirms that framework for deformable registration can achieve comparable results with the recent RCN network.

Fig-3

Figure 2:The visualization comparison between LDR (a), LDR + ALDK (b), and the baseline RCN (c). The left images are sections of the warped images; the right images are sections of the warped segmentation (white color represents the matched areas between warped image and fixed image, red color denotes the miss-matched areas). The segmentation visualization indicates that LDR + ALDK (b) method reduces the miss-matched areas of the student network LDR (a) significantly. Best viewed in color.

Reference

[1] Tran, Minh Q., et al. "Light-weight deformable registration using adversarial learning with distilling knowledge." IEEE Transactions on Medical Imaging, 2022.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge (Part 2)

In this part, we will introduce the Architecture of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge.

The Architecture of Light-weight Deformable Registration Network

In practice, recent deformation networks follow an encoder-decoder architecture and use 3D convolution to progressively down-sample the image, and deconvolution (transposed convolution) to recover spatial resolution [1, 3]. However, this setup consumes a large number of parameters. Therefore, the built models are computationally expensive and time-consuming. To overcome this problem we design a new light-weight student network as illustrated in Figure 1.

In particular, the proposed light-weight network has four convolution layers and three deconvolution layers. Each convolutional layer has a bank of 4×4×44 \times 4 \times 4 filters with strides of 2×2×22 \times 2 \times 2, followed by a ReLU activation function. The number of output channels of the convolutional layers starts with 1616 at the first layer, doubling at each subsequent layer, and ends up with 128128. Skip connections between the convolutional layers and the deconvolutional layers are added to help refine the dense prediction. The subnetwork outputs a dense flow prediction field, i.e., a 33 channels volume feature map with the same size as the input.

In comparison with the current state-of-the-art dense deformable registration network [3], the number of parameters of our proposed light-weight student network is reduced approximately 1010 times. In practice, this significant reduction may lead to an accuracy drop. Therefore, we propose a new Adversarial Learning with Distilling Knowledge algorithm to effectively leverage the teacher deformations ϕt\phi_t to our introduced student network, making it light-weight but achieving competitive performance.

Fig-1

Figure 1: The structure of Light-weight Deformable Registration student network. The number of channels is annotated above the layer. Curved arrows represent skip paths (layers connected by an arrow are concatenated before transposed convolution). Smaller canvas means lower spatial resolution (Source).

Adversarial Learning Algorithm with Distilling Knowledge

Our adversarial learning algorithm aims to improve the student network accuracy through the distilled teacher deformations extracted from the teacher network. The learning method comprises a deformation-based adversarial loss Ladv\mathcal{L}_{adv} and its accompanying learning strategy (Algorithm 1).

Fig-2

Figure 2: Adversarial Learning Strategy(Source).

Adversarial Loss. The loss function for the light-weight student network is a combination of the discrimination loss ldisl_{dis} and the reconstruction loss lresl_{res}. However, the forward and backward process through loss function is controlled by the Algorithm 1. In particular, the last deformation loss Ladv\mathcal{L}_{adv} that outputs the final warped image can be written as:

Ladv=γlrec+(1γ)ldis\mathcal{L}_{adv} = \gamma l_{rec} + (1 - \gamma) l_{dis}

where γ\gamma controls the contribution between lrecl_{rec} and ldisl_{dis}. Note that, the Ladv\mathcal{L}_{adv} is only applied for the final warped image.

Discrimination Loss. In the student network the discrimination loss is computed in Equation below}.

ldis=Dθ(ϕs)Dθ(ϕt)22+λ(ϕ^sDθ(ϕ^s)21)2l_{{dis}} = \left\lVert D_\mathbf{\theta}(\phi_{s}) - D_\mathbf{\theta}(\phi_{t}) \right\lVert_2^{2} + \lambda\bigg(\left\lVert \nabla_{\hat\phi_{s}}D_\mathbf{\theta}(\hat\phi_{s}) \right\lVert_2 - 1\bigg)^{2}

where λ\lambda controls gradient penalty regularization. The joint deformation ϕ^s\hat\phi_{s} is computed from the teacher deformation ϕt\phi_{t} and the predicted student deformation ϕs\phi_{s} as follow:

ϕ^s=βϕt+(1β)ϕs\hat\phi_{s} = \beta \phi_{t} + (1 - \beta) \phi_{s}

where β\beta control the effect of the teacher deformation.

In Discrimination Loss, DθD_\mathbf{\theta} is the discriminator, formed by a neural network with learnable parameters θ{\theta}. The details of DθD_\mathbf{\theta} is shown in Figure 3. In particular, DθD_\mathbf{\theta} consists of six 3D3D convolutional layers, the first layer is 128×128×128×3128 \times 128 \times 128 \times 3 and takes the c×c×c×1c \times c \times c \times 1 deformation as input. cc is equaled to the scaled size of the input image. The second layer is 64×64×64×1664 \times 64 \times 64 \times 16. From the second layer to the last convolutional layer, each convolutional layer has a bank of 4×4×44 \times 4 \times 4 filters with strides of 2×2×22 \times 2 \times 2, followed by a ReLU activation function except for the last layer which is followed by a sigmoid activation function. The number of output channels of the convolutional layers starts with 1616 at the second layer, doubling at each subsequent layer, and ends up with 256256.

Basically, this is to inject the condition information with a matched tensor dimension and then leave the network learning useful features from the condition input. The output of the last neural layer is the mean feature of the discriminator, denoted as MM. Note that in the discrimination loss, a gradient penalty regularization is applied to deal with critic weight clipping which may lead to undesired behavior in training adversarial networks.

Fig-3

Figure 3: The structure of the discriminator DθD_\mathbf{\theta} used in the Discrimination Loss (ldisl_{dis}) of our Adversarial Learning with Distilling Knowledge algorithm (Source).

Reconstruction Loss. The reconstruction loss lrecl_{rec} is an important part of a deformation estimator. Follow the VTN [3] baseline, the reconstruction loss is written as:

lrec(Imh,If)=1CorrCoef[Imh,If]l_{{rec}} (\textbf{\textit{I}}_m^h,\textbf{\textit{I}}_f) = 1 - CorrCoef [\textbf{\textit{I}}_m^h,\textbf{\textit{I}}_f]

where

CorrCoef[I1,I2]=Cov[I1,I2]Cov[I1,I1]Cov[I2,I2]CorrCoef[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] = \frac{Cov[\textbf{\textit{I}}_1,\textbf{\textit{I}}_2]}{\sqrt{Cov[\textbf{\textit{I}}_1,\textbf{\textit{I}}_1]Cov[\textbf{\textit{I}}_2,\textbf{\textit{I}}_2]}}
Cov[I1,I2]=1ωxωI1(x)I2(x)1ω2xωI1(x)yωI2(y)Cov[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] = \frac{1}{|\omega|}\sum_{x \in \omega} \textbf{\textit{I}}_1(x)\textbf{\textit{I}}_2(x) - \frac{1}{|\omega|^{2}}\sum_{x \in \omega} \textbf{\textit{I}}_1(x)\sum_{y \in \omega}\textbf{\textit{I}}_2(y)

where CorrCoef[I1,I2]CorrCoef[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] is the correlation between two images I1\textbf{\textit{I}}_1 and I2\textbf{\textit{I}}_2, Cov[I1,I2]Cov[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] is the covariance between them. ω\omega denotes the cuboid (or grid) on which the input images are defined.

Learning Strategy. The forward and backward of the aforementioned Ladv\mathcal{L}_{adv} is controlled by the adversarial learning strategy described in Algorithm 1.

In our deformable registration setup, the role of real data and attacking data is reversed when compared with the traditional adversarial learning strategy. In adversarial learning, the model uses unreal (generated) images as attacking data, while image labels are ground truths. However, in our deformable registration task, the model leverages the unreal (generated) deformations from the teacher as attacking data, while the image is the ground truth for the model to reconstruct the input information. As a consequence, the role of images and the labels are reversed in our setup. Since we want the information to be learned more from real data, the generator will need to be considered more frequently. Although the knowledge in the discriminator is used as attacking data, the information it supports is meaningful because the distilled information is inherited from the high-performed teacher model. With these characteristics of both the generator and discriminator, the light-weight student network is expected to learn more effectively and efficiently.

Reference

[1] S. Zhao, Y. Dong, E. I. Chang, Y. Xu, et al., Recursive cascaded networks for unsupervised medical image registration, in ICCV, 2019.

[2] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ArXiv, 2015.

[3] S. Zhao, T. Lau, J. Luo, I. Eric, C. Chang, and Y. Xu, Unsupervised 3d end-to-end medical image registration with volume tweening network, IEEE J-BHI, 2019.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge

Introduction: Medical image registration

Medical image registration is the process of systematically placing separate medical images in a common frame of reference so that the information they contain can be effectively integrated or compared. Applications of image registration include combining images of the same subject from different modalities, aligning temporal sequences of images to compensate for the motion of the subject between scans, aligning images from multiple subjects in cohort studies, or navigating with image guidance during interventions. Since many organs do deform substantially while being scanned, the rigid assumption can be violated as a result of scanner-induced geometrical distortions that differ between images. Therefore, performing deformable registration is an essential step in many medical procedures.

Previous Studies, Remaining Challenges, and Motivation

Recently, learning-based methods have become popular to tackle the problem of deformable registration. These methods can be split into two groups: (i) supervised methods that rely on the dense ground-truth flows obtained by either traditional algorithms or simulating intra-subject deformations. Although these works achieve state-of-the-art performance, they require a large amount of manually labeled training data, which are expensive to obtain; and (ii) unsupervised learning methods that use a similarity measurement between the moving and the fixed image to utilize a large amount of unlabelled data. These unsupervised methods achieve competitive results in comparison with supervised methods. However, their deformations are reconstructed without the direct ground-truth guidance, hence leading to the limitation of leveraging learnable information. Furthermore, recent unsupervised methods all share an issue of great complexity as the network parameters increase significantly when multiple progressive cascades are taken into account. This leads to the fact that these works can not achieve real-time performance during inference while requiring intensively computational resources when deploying.

In practice, there are many scenarios when medical image registration are needed to be fast - consider matching preoperative and intra-operative images during surgery, interactive change detection of CT or MRI data for a radiologist, deformation compensation or 3D alignment of large histological slices for a pathologist, or processing large amounts of images from high-throughput imaging methods. Besides, in many image-guided robotic interventions, performing real-time deformable registration is an essential step to register the images and deal with organs that deform substantially. Economically, the development of a CPU-friendly solution for deformable registration will significantly reduce the instrument costs equipped for the operating theatre, as it does not require GPU or cloud-based computing servers, which are costly and consume much more power than CPU. This will benefit patients in low- and middle-income countries, where they face limitations in local equipment, personnel expertise, and budget constraints infrastructure. Therefore, design an efficient model which is fast and accurate for deformable registration is a crucial task and worth for study in order to improve a variety of surgical interventions.

Contribution

Deformable registration is a crucial step in many medical procedures such as image-guided surgery and radiation therapy. Most recent learning-based methods focus on improving the accuracy by optimizing the non-linear spatial correspondence between the input images. Therefore, these methods are computationally expensive and require modern graphic cards for real-time deployment. Thus, we introduce a new Light-weight Deformable Registration network that significantly reduces the computational cost while achieving competitive accuracy (Fig.1). In particular, we propose a new adversarial learning with distilling knowledge algorithm that successfully leverages meaningful information from the effective but expensive teacher network to the student network. We design the student network such as it is light-weight and well suitable for deployment on a typical CPU. The extensively experimental results on different public datasets show that our proposed method achieves state-of-the-art accuracy while significantly faster than recent methods. We further show that the use of our adversarial learning algorithm is essential for a time-efficiency deformable registration method.

Fig-1

(a)
(b)
Figure 1: Comparison between typical deep learning-based methods for deformable registration (a) and our approach using adversarial learning with distilling knowledge for deformable registration (b). In our work, the expensive Teacher Network is used only in training; the Student Network is light-weight and inherits helpful knowledge from the Teacher Network via our Adversarial Learning algorithm. Therefore, the Student Network has high inference speed, while achieving competitive accuracy (Source).

Methodology

Method overview

We describe our method for Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge. Our method is composed of three main components: (i)) a Knowledge Distillation module which extracts meaningful deformations ϕt\bm{\phi_t} from the Teacher Network; (ii) a Light-weight Deformable Registration (LDR) module which outputs a high-speed Student Network; and (iii) an Adversarial Learning with Distilling Knowledge (ALDK) algorithm which effectively leverages teacher deformations ϕt\bm{\phi}_t to the student deformations. An overview of our proposed deformable registration method can be found in Fig.2.

Fig-2

Figure 2: An overview of our proposed Light-weight Deformable Registration (LDR) method using Adversarial Learning with Distilling Knowledge (ALDK). Firstly, by using knowledge distillation, we extract the deformations from the Teacher Network as meaningful ground-truths. Secondly, we design a light-weight student network, which has competitive speed. Finally, We employ the Adversarial Learning with Distilling Knowledge algorithm to effectively transfer the meaningful knowledge of distilled deformations from the Teacher Network to the Student Network (Source).

Since the content may over-length, in this part, we introduce the background theory for Deformable Registration and Knowledge Distillation for Deformation. In the next part, we will introduce the Architecture of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge. In the final part, we will introduce the effectiveness of the method in comparison with recent states of the arts and detailed analysis.

Background: Deformable Registration

We follow RCN [1] to define deformable registration task recursively using multiple cascades. Let Im,If\textbf{\textit{I}}_m, \textbf{\textit{I}}_f denote the moving image and the fixed image respectively, both defined over dd-dimensional space Ω\bm{\Omega}. A deformation is a mapping ϕ:ΩΩ\bm{\phi} : \bm{\Omega} \rightarrow \bm{\Omega}. A reasonable deformation should be continuously varying and prevented from folding. The deformable registration task is to construct a flow prediction function F\textbf{F} which takes Im,If\textbf{\textit{I}}_m, \textbf{\textit{I}}_ f as inputs and predicts a dense deformation ϕ\bm{\phi} that aligns Im\textbf{\textit{I}}_m to If\textbf{\textit{I}}_f using a warp operator \circ as follows:

F(n)(Im(n1),If)=ϕ(n)F(n1)(ϕ(n1)Im(n2),If)\textbf{F}^{(n)}(\textbf{\textit{I}}^{(n-1)}_m,\textbf{\textit{I}}_f)=\phi^{(n)} \circ \textbf{F}^{(n-1)}(\phi^{(n-1)} \circ \textbf{\textit{I}}^{(n-2)}_m,\textbf{\textit{I}}_f)

where F(n1)\textbf{F}^{(n-1)} is the same as F(n)\textbf{F}^{(n)}, but in a different flow prediction function. Assuming for nn cascades in total, the final output is a composition of all predicted deformations, i.e.,

F(Im,If)=ϕ(n)...ϕ(1),\textbf{F}(\textbf{\textit{I}}_m, \textbf{\textit{I}}_f)=\phi^{(n)} \circ...\circ \phi^{(1)},

and the final warped image is constructed by

Im(n)=F(Im,If)Im\textbf{\textit{I}}_{m}^{(n)}=\textbf{F}(\textbf{\textit{I}}_m,\textbf{\textit{I}}_f) \circ \textbf{\textit{I}}_m

In general, previous Equations form the hypothesis function F\mathcal{F} under the learnable parameter W\mathbf{W},

F(Im,If,W)=(vϕ,Im(n))\mathcal{F}(\textbf{\textit{I}}_{m}, \textbf{\textit{I}}_f, \mathbf{W}) = (\mathbf{v}_{\phi}, \textbf{\textit{I}}_m^{(n)})

where vϕ=[ϕ(1),ϕ(2),...,ϕ(k),...,ϕ(n)]\mathbf{v}_{\phi} = [\bm{\phi}^{(1)}, \bm{\phi}^{(2)}, ..., \bm{\phi}^{(k)},..., \bm{\phi}^{(n)}] is a vector containing predicted deformations of all cascades. Each deformation ϕ(k)\bm{\phi}^{(k)} can be computed as

ϕ(k)=F(k)(Im(k1),If,Wϕ(k))\bm{\phi}^{(k)} = {\mathcal{F}}^{(k)}\left(\textbf{\textit{I}}_{m}^{(k-1)}, \textbf{\textit{I}}_f, \mathbf{W}_{\phi^{(k)}}\right)

To estimate and achieve a good deformation, different networks are introduced to define and optimize the learnable parameter W\mathbf{W}.

Knowledge Distillation for Deformation

Knowledge distillation is the process of transferring knowledge from a cumbersome model (teacher model) to a distilled model (student model). The popular way to achieve this goal is to train the student model on a transfer set using a soft target distribution produced by the teacher model.

Different from the typical knowledge distillation methods that target the output softmax of neural networks as the knowledge, in the deformable registration task, we leverage the teacher deformation ϕt\bm{\phi}_t as the transferred knowledge. As discussed in [2], teacher networks are usually high-performed networks with good accuracy. Therefore, our goal is to leverage the current state-of-the-art Recursive Cascaded Networks (RCN) [1] as the teacher network for extracting meaningful deformations to the student network. The RCN network contains an affine transformation and a large number of dense deformable registration sub-networks designed by VTN [3]. Although the teacher network has expensive computational costs, it is only applied during the training and will not be used during the inference.

Reference

[1] S. Zhao, Y. Dong, E. I. Chang, Y. Xu, et al., Recursive cascaded networks for unsupervised medical image registration, in ICCV, 2019.

[2] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ArXiv, 2015.

[3] S. Zhao, T. Lau, J. Luo, I. Eric, C. Chang, and Y. Xu, Unsupervised 3d end-to-end medical image registration with volume tweening network, IEEE J-BHI, 2019.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Multiple Meta-model Quantifying for Medical Visual Question Answering

Motivation

A medical Visual Question Answering (VQA) system can provide meaningful references for both doctors and patients during the treatment process. Extracting image features is one of the most important steps in a medical VQA framework which outputs essential information to predict answers. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:

  • (i) noisy labels may occur when labeling images in an unsupervised manner;
  • (ii) high-level semantic labels cause uncertainty during learning;
  • (iii) difficulty in scaling up the process to all unlabeled images in medical datasets.

Overcoming Data Limitation in Medical Visual Question Answering

What are the difficulties when dealing with Medical VQA task?

Visual Question Answering (VQA) aims to provide a correct answer to a given question such that the answer is consistent with the visual content of a given image.

In medical domain, VQA could benefit both doctors and patients. For example, doctors could use answers provided by VQA system as support materials in decision making, while patients could ask VQA questions related to their medical images for better understanding their health.

Fig-1

Figure 1: An example of Medical VQA (Source).

However, one major problem with medical VQA is the lack of large scale labeled training data which usually requires huge efforts to build.

  • The first attempt for building the dataset for medical VQA is by ImageCLEF-Med. In this, images were automatically captured from PubMed Central articles. The questions and answers were automatically generated from corresponding captions of images. By that construction, the data has high noisy level, i.e., the dataset includes many images that are not useful for direct patient care and it also contains questions that do not make any sense.
  • Recently, the first manually constructed VQA-RAD dataset for medical VQA task is released. Unfortunately, it contains only 315 images, which prevents to directly apply the powerful deep learning models for the VQA problem. One may think about the use of transfer learning in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet are used for finetuning on the medical VQA. However, due to difference in visual concepts between ImageNet images and medical images, finetuning with very few medical images is not sufficient.

Therefore it is necessary to develop a new VQA framework that can improve the accuracy while still only needs a small labeled training data.

The motivation for our approach to overcome the data limitation of medical VQA comes from two observations:

  • Firstly, we observe that there are large scale unlabeled medical images available. These images are from same domain with medical VQA images. Hence if we train an unsupervised deep learning model using these unlabeled images, the trained weights may be easier to be adapted to the medical VQA problem than the pretrained weights on ImageNet images.
  • Another observation is that although the labeled dataset VQA-RAD is primarily designed for VQA, by spending a little effort, we can extract the new class labels for that dataset. The new class labels allow us to apply the recent meta-learning technique for learning meta-weights, that can be quickly adapted to the VQA problem later.

Methodology

The proposed medical VQA framework is presented in Figure 2. In our framework, the image feature extraction component is initialized by pretrained weights from MAML and CDAE. After that, the VQA framework will be finetuned in an end-to-end manner on the medical VQA data. In the following sections, we detail the architectures of MAML, CDAE, and our framework.

Fig-2

Figure 2: The proposed medical VQA. The image feature extraction is denoted as 'Mixture of Enhanced Visual Features (MEVF)' and is marked with the red dashed box. The weights of MEVF are intialized by MAML and CDAE (Source).

Model-Agnostic Meta-Learning -- MAML

The MAML model consists of four 3×33\times3 convolutional layers with stride 22 and is ended with a mean pooling layer; each convolutional layer has 6464 filters and is followed by a ReLu layer.

We create the dataset for training MAML by manually reviewing around three thousand question-answer pairs from the training set of VQA-RAD dataset. In our annotation process, images are split into three parts based on its body part labels (head, chest, abdomen). Images from each body part are further divided into three subcategories based on the interpretation from the question-answer pairs corresponding to the images. These subcategories are: 1. normal images in which no pathology is found. 2. abnormal present images in which there are the existence of fluid, air, mass, or tumor. 3. abnormal organ images in which the organs are large in size or in wrong position.

Thus, all the images are categorized into 9 classes:

| head normal | head abnormal present | head abnormal organ |
| chest normal | chest abnormal organ | chest abnormal present |
| abdominal normal | abdominal abnormal organ | abdominal abnormal present |

For every iteration of MAML training (line 3 in Alg.1), 5 tasks are sampled per iteration. For each task, we randomly select 3 classes (from 9 classes). For each class, we randomly select 6 images in which 3 images are used for updating task models and the remaining 3 images are used for updating meta-model.

Alg-1

Denoising Auto Encoder -- CDAE

The encoder maps an image xx', which is the noisy version of the original image xx, to a latent representation zz which retains useful amount of information. The decoder transforms zz to the output yy. The training algorithm aims to minimize the reconstruction error between yy and the original image xx as follows

Lrec=xy22L_{rec} = \left \| x-y \right \|_2^2

In our design, the encoder is a stack of convolutional layers; each of them is followed by a max pooling layer. The decoder is a stack of deconvolutional and convolutional layers. The noisy version xx' is achieved by adding Gaussian noise to the original image xx.

To train CDAE, we collect 11,77911,779 unlabeled images available online which are brain MRI images, chest X-ray images and CT abdominal images. The dataset is split into train set with 9,4239,423 images and test set with 2,3562,356 images. We use Gaussian noise to corrupt the input images before feeding them to the encoder.

Our VQA framework

After training MAML and CDAE, we use their trained weights to initialize the MEVF image feature extraction component in the VQA framework. We then finetune the whole VQA model using the training set of VQA-RAD dataset.

To train the proposed model, we introduce a multi-task loss func-tion to incorporate the effectiveness of the CDAE to VQA. Formally, our lossfunction is defined as follows:

L=α1Lvqa+α2LrecL = \alpha_1 L_{vqa} + \alpha_2 L_{rec}

where LvqaL_{vqa} is a Cross Entropy loss for VQA classification and LrecL_{rec} stands for the reconstruction loss of CDAE . The whole VQA model is finetuned in an end-to-end manner.

Results

Tab-1

Table 1: VQA results on VQA-RAD test set. All reference methods differ at the image feature extraction component. Other components are similar. The Stacked Attention Network (SAN) is used as the attention mechanism in all methods (Source).

Table 1 presents VQA accuracy in both VQA-RAD open-ended and close-ended questions on the test set. The results show that for both MAML and CDAE, by firstly pretraining then finetuning, the finetuning significantly improves the performance over the training from scratch using only VQA-RAD.

In addition, the results also show that our pretraining and finetuning of MAML and CDAE give better performance than the finetuning of VGG-16 which is pretrained on the ImageNet dataset. Our proposed image feature extraction MEVF which leverages both pretrained weights of MAML and CDAE, then finetuning them give the best performance. This confirms the effectiveness of the proposed MEVF for dealing with the limitation of labeled training data for medical VQA.

Tab-2

Table 2: Performance comparison on VQA-RAD test set (Source).

Table 2 presents comparative results between methods. Note that for the image feature extraction, the baselines use the pretrained models (VGG or ResNet) that have been trained on ImageNet and then finetune on the VQA-RAD dataset. For the question feature extraction, all baselines and our framework use the same pretrained models (i.e., Glove) and finetuning on VQA-RAD. The results show that when BAN or SAN is used as the attention mechanism in our framework, it significantly outperforms the baseline frameworks BAN and SAN. Our best setting, i.e. the one with BAN as the attention, achieves the state-of-the-art results and it significantly outperforms the best baseline framework BAN, i.e., the improvements are 16.3%16.3\% and 8.6%8.6\% on open-ended and close-ended VQA, respectively.

Conclusion

In this paper, we proposed a novel medical VQA framework that leverages the meta-learning MAML and denoising auto-encoder CDAE for image feature extraction in order to overcome the limitation of labeled training data. Specifically, CDAE helps to leverage information from the large scale unlabeled images, while MAML helps to learn meta-weights that can be quickly adapted to the VQA problem. We establish new state-of-the-art results on VQA-RAD dataset for both close-ended and open-ended questions.

Open Source

🐱 Github: https://github.com/aioz-ai/MICCAI19-MedVQA

Data Augmentation for Colon Polyp Detection: A systematic Study

Colorectal cancer (CRC)♋, also known as bowel cancer or colon cancer, is a cancer development from the colon or rectum called a polyp. Detecting polyps is a common approach in screening colonoscopies to prevent CRC at an early stage. Early colon polyp detection from medical images is still an unsolved problem due to the considerable variation of polyps in shape, texture, size, color, illumination, and the lack of publicly annotated datasets. At AIOZ, we adopt a recently proposed auto-augmentation method for polyp detection. We also conduct a systematic study on the performance of different data augmentation methods for colon polyp detection. The experimental results show that the auto-augmentation achieves the best performance comparing to other augmentation strategies.

Introduction

Colorectal cancer (CRC) is the third-largest cause of worldwide cancer deaths in men and the second cause in women, with the number of patients, died each year up to 700,000 [1]. Detection and removal of colon polyps at an early stage will reduce the mortality from CRC. There are several methods for colon screening, such as CT colonography or wireless capsule endoscopy, but the gold standard is colonoscopy [2].

The colonoscopy is performed by an experienced doctor who uses a colonoscope to screen and scan for abnormalities such as intestinal signs, symptoms, colon cancer, and polyps. Abnormal polyps can be removed, and small amounts of tissue can be detached for analysis during the colonoscopy. However, the most crucial drawback of colonoscopy is polyp miss rate, especially with polyp more diminutive than10mm. Several factors cause the miss rate. They are both subjective factors such as bowel preparation, the specific choice of an endoscope, video processor, clinician skill, and objective factors such as polyp appearance and camera movement condition. For these reasons, automatic polyp detection is a potential approach to assist clinicians in improving the sensitivity of the diagnosis.

Previous research shows that automatic polyp detection using deep learning-based methods outperforms hand-craft-based methods demonstrated by both top two results in the MICCAI 2015 challenge [3]. For deep learning-based approaches and model architectures, data augmentation is also a critical factor in making significant improvements due to the lack of annotated data. The recent work [4]shows that learning an optimal policy from data for auto augmentation instead of hand-crafted defining data augmentation strategies can generalize objects better. Thus, studying auto augmentation for polyp detection problems is necessary. In this research, we adapt Faster R-CNN [5] together withAutoAugment [6] to detect polyp from colonoscopy video frames. Besides, we also evaluate traditional data augmentation [7] to see the effectiveness of different augmentation strategies.

Methodology

1. Polyp Detector

Thanks to the power of deep learning, recent works [5, 12,11] show that deep-based detection methods give impressive detection performance. In this work, we use the Faster RCNN object detector [5] with Resnet101 [13] backbone pre-trained on COCO dataset. Our experiments show that this architecture gives a competitive performance on the Polyp detection problem. The experimental setting for the detector is set as follows. The network is trained using stochastic gradient descent (SGD)with 0.9 momentum; learning rate with initial value is set to3e-4 and will be decreased to 3e-5 from the iteration 900k.The number of anchor boxes per location used in our model is 12 (4 scales, i.e.,64×64, 128×128, 256×256, 512×512 and 3 aspect ratios, i.e.,1 : 2,1 : 1,2 : 1).

2. Data Augmentation

autoaugment_figure_small

Fig. 1. Example of applying learned augmentation policies tocolonoscopy image.

Data augmentation can be split into two types: self-defined data augmentation (a.k.a traditional augmentation) and auto augmentation [6]. In this study, we adopt an automated data augmentation approach for object detection, i.e., Auto-augment [6], which finds optimal data augmentation policies during training. In Auto-augment, an augmentation policy consists of several sub-policies; each sub-policy consists of two operations. Each operation is an image transformation containing two parameters: probability and the magnitude of the shift. There are three types of transformations used in Auto-augment for object detection [4], which are

  • Color operations: distort color channels without impacting the locations of the bounding boxes
  • Geometric operations: geometrically distort the image, which correspondingly alters the location and size of the bounding box annotations
  • Bounding box operations: only distort the pixel content contained within the bounding box annotations. One of the essential conclusions in [4] is that the learned policy found on COCO can be directly applied to other detection datasets and models to improve predictive accuracy. Hence, in this study, we apply the learned policies from [4]to augment data when training the detector in Sec. 3.1. The learned policed we use for training our detector is summarized in Table 1. In Table 1, each operation is a triple which describes the transformation, the probability, and the magnitude of the transformations. Due to the space limitation, we refer the reader to [4] for detail on the descriptions of trans-formation. Fig. 1 showed augmented examples when applying the learned augmentation policy on a polyp image from the training dataset.

tab-1

Table 1. Sub-policies and operations used in our experiment.

In addition to auto-augmentation, we also investigate the effect of traditional augmentation and the combination of traditional and automatic augmentation. We randomly apply several transformations to the image for traditional data augmentation, such as rotation, mirroring, sheering, translation, and zoom. We propose different strategies to combine those data augmentation types, i.e., (1) firstly, the detector is trained with Auto-augment; after that, it is trained with the traditional data augmentation; (2) training with the traditional augmentation, then with auto augmentation; (3) training with AutoAugmenton the original data and the data generated by traditional augmentation. All augmentation strategies are evaluated with the same model architecture and training configuration. This allows us to explore which data augmentation method is suitable for the polyp detection problem.

Experiments

We use CVC-ClinicDB [14] for training and ETIS-Larib[15] for testing. This allows us to make a fair comparison with MICCAI2015 challenge results which are reported on the same dataset. The CVC-CLINIC database contains 612polyp image frames of 31 unique polyps from 31 different colonoscopy videos. The ETIS-LARIB dataset contains 196high resolution image frames of 44 different polyps.

fp_figure

Fig. 2. Examples of false positive detection on testing dataset. Green boxes and blues boxes are ground truths and predictions, respectively.

fn_figure

Fig. 3. Examples of false-negative detection on the testing dataset. Green boxes and blues boxes are ground truths and predictions, respectively.

Fig. 2 and Fig. 3 visualize several failed results from our model in the testing dataset in which the blue boxes are the predicted locations, and green boxes are ground truths. These false-positive samples (Fig. 2) caused by a shortcoming in bowel preparation (i.e., leftovers of food and fluid in the colon), while false negative (Fig. 3) samples are caused by the variations of polyp type and appearance (i.e., small polyp, flat polyp, similarities of polyp and colon vein)

tab-3

Table 2. Comparison among traditional data augmentation (TDA), auto augmentation (AA) and their combinations.
Table 2 shows the comparative results between different augmentation strategies. The results show that the third combination method (AA-TDA-3) achieves higher performance in Precision than AutoAugment, i.e., 75.90% and 74.51%, respectively. However, overall, Auto-augment (AA) achieves the best results because of its performance in covering polyp miss rate (i.e., 152) with an acceptable false-positive rate (i.e., 52). The competitive performance of auto augmentation (AA) confirms the transferable learned data augmentation policies on the COCO dataset [4].

tab-4

Table 3. Comparative results between our model and the state of the art.
Table 3 presents the comparative results between the auto augmentation in our model and other state-of-the-art results. Among compared methods, CUMED, OUR, and UNS- UCLAN are end-to-end deep learning-based approaches. The results show that compared to methods from MICCAI challenge [3], auto augmentation achieves better performance on all metrics. Comparing to the recent method [10], auto augmentation also achieves better performance on all metrics but FP. These results confirm the effectiveness of auto augmentation for polyp detection problems.

Conclusion

This study adopts a deep learning-based object detection method with auto data augmentation for polyp detection problems. Different augmentation strategies are evaluated. The experimental results show that the learned auto augmentation policies learned from the general object detection dataset are well transferred to the polyp detection problem. Although auto augmentation achieves competitive results, it still has a high FP compared to the state of the art. This weakness can be improved by several post-processing, such as false-positive learning.

Open Source

🍅 Github: https://github.com/aioz-ai/polyp-detection

🍓 Blog post: https://ai.aioz.io/blog/polyp-detection

Acknowledgements

This research was conducted by Phong Nguyen, Quang Tran, Erman Tjiputra, and Toan Do. We’d like to give special thanks to the other AIOZ AI team members for their supports and feedbacks.

🎉 All the above contributions were incredibly enabling for this research. 🎉

Reference

[1] Hamidreza Sadeghi Gandomani, Mohammad Aghajani, et al., “Colorectal cancer in the world: incidence, mortality and risk factors,”Biomedical Research and Therapy, 2017.

[2] Florence B ́enard, Alan N Barkun, et al., “Systematic review of colorectal cancer screening guidelines for average-risk adults: Summarizing the current global recommendations,”World journal of gastroenterology, 2018.

[3] Jorge Bernal, Nima Tajkbaksh, et al., “Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,”IEEE Transactions on Medical Imaging, pp. 1231–1249, 2017.

[4] Barret Zoph, Ekin D Cubuk, et al., “Learning data augmentation strategies for object detection,”arXiv, 2019.

[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” inNIPS, 2015.

[6] Ekin D Cubuk, Barret Zoph, et al.,“Autoaugment: Learning augmentation strategies from data,” in CVPR, 2019.

[7] Younghak Shin, Hemin Ali Qadir, et al., “Automatic colon polyp detection using region based deep cnn and post learning approaches,”IEEE Access, 2018

[8] Yangqing Jia, Evan Shelhamer, et al., “Caffe: Convolutional architecture for fast feature embedding,” in ACMMM, 2014.