/ / 163 min read

FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 3)

The way to train our FedEFM and integrated the weights to various downstream task.

In this part, we will deeply dive into the way to train our FedEFM and integrated the weights to various downstream task.

Method

Notations

We describe the notations used in the methodology

NotationDescription
θi(k)\theta_i(k)Weights in silo ii in commuication round kk
ξi\xi_iMini batch data in silo ii
θij\theta_{i \rightarrow j}Weights in silo ii that is transferred to silo jj
θ^ij\hat{\theta}_{i \rightarrow j}Successfully transferred weights from silo jj back to silo ii
Lc\mathcal{L}_cFoundation loss function
α\alphaLearning rate
Ni\mathcal{N}_iIn-neighbors of silo ii in the topology
TTTemperature for distillation
ϑ{0,1}\vartheta \in \{0,1\}Accumulation status
NNNumber of silos

Federated Distillation

Figure 1 demonstrate the algorithm used for training a foundation model within a decentralized federated learning process, effectively addressing the issue of the unseen data problem.

Specifically, in the initial round, local model weights θi\theta_i of each ii-th hospital silo is trained using their respective local data ξi\xi_i. Within the next communication round, we first perform overseas training where local model weights θi\theta_i of each ii-th silo is transmitted to each of their jj-th neighbor hospital silo. This process aims to let local weights θi\theta_i learn knowledge from the data ξj\xi_j of its jj-th neighbor silo.

In (k+1)(k+1)-th specific communication round, each transferred weight θij\theta_{i\rightarrow j} is optimized in jj-th silo using the following equation:

θij(k+1)=θij(k)αkLc(θij(k),ξj(k))   (1)\theta_{i\rightarrow j}(k+1)= \theta_{i\rightarrow j}\left(k\right)-\alpha_{k}\nabla \mathcal{L}_{c}\left(\theta_{i\rightarrow j}\left(k\right),\xi_j\left(k\right)\right) \ \ \ (1)

Then, we perform knowledge transfer where each learned overseas expert θij\theta_{i\rightarrow j} from the previous step is transferred back to the ii-th silo.

In the local silo ii, the local weight is updated based on both the original weight θi\theta_{i} and the transferred weights θ^ij\hat \theta_{i\rightarrow j} that is learned from the neighbour silo jj. In particular, we aim to find regions that share similarities between two weights using the Earth Mover’s Distance EMD(θi,θ^ij)\text{EMD}( \theta_{i}, \hat \theta_{i\rightarrow j}). In this way, the distance measures the contribution of transferred weights during distillation, enabling the local silo to learn from its neighbors while avoiding divergence when weight convergence goals differ significantly. Local weights θi\theta_{i} is then optimized using the following equation:

θi(k+1)=θi(k)αkjN(i)EMD(θi,θ^ij,k)LMDi(θi(k),θ^ij(k),ξi(k))  (2)\theta_{i}(k+1) = \theta_i(k)-\\ \alpha_{k}\sum_{j \in \mathcal{N}(i)}\text{EMD}( \theta_{i}, \hat \theta_{i\rightarrow j},k)\nabla \mathcal{L}^i_{\rm MD}\left({\theta}_i\left(k\right),{\hat \theta}_{i\rightarrow j}\left(k\right),\xi_i\left(k\right)\right) \ \ (2)

Differentiable Earth Mover's Distance

Assume that the input sample ξi\xi_i from ii-th local silo passes through the foundation architecture θi\theta_{i} to generate the dense representation URH×W×C\mathbf{U} \in \mathbb{R}^{H \times W \times C}, where HH and WW denote the spatial size of the feature map and CC is the feature dimension. In a parallel manner, VRH×W×C\mathbf{V} \in \mathbb{R}^{H \times W \times C} also denotes the dense representation when ξi\xi_i passes through θ^ij\hat{\theta}_{i\rightarrow j}.

Under Earth Mover circumstance, V\mathbf{V} represents suppliers transporting goods to demanders U\mathbf{U}. Then, EMD\text{EMD} between two feature sets U={u1,u2,,uHW}\mathbf{U} = \{u_1, u_2, \ldots, u_{HW}\} and V={v1,v2,,vHW}\mathbf{V} = \{v_1, v_2, \ldots, v_{HW}\} can be computed as:

EMD(θi,θ^ij)=EMD(U,V)=p=1HWq=1HW(1cpq)x~pq   (3)\text{EMD}(\theta_{i}, \hat \theta_{i \rightarrow j}) = \text{EMD}(\mathbf{U}, \mathbf{V}) = \sum_{p=1}^{HW} \sum_{q=1}^{HW} (1 - c_{pq}) \tilde{x}_{pq} \ \ \ (3)

where x~\tilde{x} is conducted from optimal matching flow \tilde{X} = \{x_1, x_2, \ldots, x_{pq}\} $ for each sample pair of two sets $\mathbf{U} and V\mathbf{V}; cpqc_{pq} is the cost per unit transported from supplier to demander and is obtained by computing the pairwise distance between embedding nodes upUu_p \subset \mathbf{U} and vqVv_q \subset \mathbf{V}.

The cost per unit cpqc_{pq} is computed as below and also plays a virtual role in computing the optimal matching flow:

cpq=1upTvqupvqc_{pq} = 1 - \frac{u_p^T v_q}{\|u_p\|\|v_q\|}

where nodes with similar representations tend to generate small matching costs between each other. Then, the optimal matching flow X~\tilde{X} is conducted by optimizing x~\tilde{x} as below:

minimizexp=1HWq=1HWcpqxpqsubject toxpq>0,p=1,,HW,  q=1,,HWp=1HWxpq=vq,q=1,,HWq=1HWxpq=up,p=1,,HW\underset{x}{\text{minimize}} \quad \sum_{p=1}^{HW} \sum_{q=1}^{HW} c_{pq} x_{pq} \\ \text{subject to} \quad x_{pq} > 0, \quad p = 1, \ldots, HW, \; q = 1, \ldots, HW\\ \sum_{p=1}^{HW} x_{pq} = v_q, \quad q = 1, \ldots, HW \\ \sum_{q=1}^{HW} x_{pq} = u_p, \quad p = 1, \ldots, HW

Here, EMD seeks an optimal matching X~\tilde{X} between suppliers and demanders such that the overall matching cost is minimized. The global optimal matching flows X~\tilde{X} can be achieved by solving a Linear Programming problem (LP). For the sake of completeness, we transform the above optimization to a compact matrix form

minimizexc(θ)Txsubject toG(θ)xh(θ),A(θ)x=b(θ).\underset{x}{\text{minimize}} \quad c(\theta)^T x \\ \text{subject to} \quad G(\theta)x \leq h(\theta),\\ A(\theta)x = b(\theta).

Here xRHW×HWx \in \mathbb{R}^{HW \times HW} is our optimization variable. Ax=bAx = b represents the equality constraint and GxhGx \leq h denotes the inequality constraint in our optimization problem. Accordingly, the Lagrangian of the LP problem is given by:

L(θ,x,ν,λ)=cTx+λT(Gxh)+νT(Axb),L(\theta, x, \nu, \lambda) = c^T x + \lambda^T (Gx - h) + \nu^T (Ax - b),

where ν\nu denotes the dual variables on the equality constraints and λ0\lambda \geq 0 denotes the dual variables on the inequality constraints. Following the KKT conditions, we obtain the optimum (x~,ν~,λ~)(\tilde{x}, \tilde{\nu}, \tilde{\lambda}) of the objective function by solving g(θ,x~,ν~,λ~)=0g(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}) = 0 with primal-dual interior point methods, where

g(θ,x,ν,λ)=[θL(θ,x,ν,λ)diag(λ)(G(θ)xh(θ))A(θ)xb(θ)].g(\theta, x, \nu, \lambda) = \begin{bmatrix} \nabla_{\theta} L(\theta, x, \nu, \lambda) \\ \textbf{diag}(\lambda)(G(\theta)x - h(\theta)) \\ A(\theta)x - b(\theta) \end{bmatrix}.

Then, with the theorem below, we can derive the gradients of the LP parameters.

Suppose g(θ,λ~,ν~,x~)=0g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) = 0. Then, when all derivatives exist, the partial Jacobian of x~\tilde{x} with respect to θ\theta at the optimal solution (λ~,ν~,x~)(\tilde{\lambda}, \tilde{\nu}, \tilde{x}), namely Jθx~J_{\theta}\tilde{x}, can be obtained by satisfying:

Jθx~=(Jxg(θ,λ~,ν~,x~))1Jθg(θ,x~,ν~,λ~).J_{\theta}\tilde{x} = - \left( J_{x} g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) \right)^{-1} J_{\theta} g(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}).

Then, applying to the KKT conditions, the (partial) Jacobian with respect to θ\theta can be defined as:

Jθg(θ,λ~,ν~,x~)=[JθxL(θ,x~,ν~,λ~)diag(λ~)Jθ(G(θ)xh(θ))Jθ(A(θ)x~b(θ))]J_{\theta} g(\theta, \tilde{\lambda}, \tilde{\nu}, \tilde{x}) = \begin{bmatrix} J_{\theta} \nabla_{x} L(\theta, \tilde{x}, \tilde{\nu}, \tilde{\lambda}) \\ \textbf{diag}(\tilde{\lambda}) J_{\theta} (G(\theta)x - h(\theta)) \\ J_{\theta} (A(\theta) \tilde{x} - b(\theta)) \end{bmatrix}

After obtaining the optimal x~\tilde{x}, we can derive a closed-form gradient for θ\theta, enabling efficient backpropagation without altering the optimization path.

Figure 1. Federated Knowledge Distillation pipeline with EMD Distance

Training

The distillation loss of ii-th silo LMDi\mathcal{L}^i_{\rm MD} based on student model loss is designed as:

LMDi=βT2j=1N(i)(Lc(QSiτ,QTijτ))+(1β)Lc(QSi,ytruei)  (11)\mathcal{L}^i_{\rm MD} = \beta T^2 \sum^{\mathcal{N}(i)}_{j=1} \left( \mathcal{L}_{c}(Q^\tau_{S_i}, Q^\tau_{T_{i\rightarrow j}}) \right) + (1-\beta)\mathcal{L}_{c}(Q_{S_i},y^i_{true})\ \ (11)

where QSQ_S is the standard softmax output of the local student; ytrueiy^i_{true} is the ground-truth labels, β\beta is a hyper-parameter for controlling the importance of each loss component; QSiτ,QTijτQ^\tau_{S_i}, Q^\tau_{T_{i\rightarrow j}} are the softened outputs of the ii-th local student and the jj-th overseas teachers using the same temperature parameter

Qkτ=exp(lk/T)kexp(lk/T)Q^\tau_k = \frac{\exp(l_k/T)}{\sum_{k} \exp(l_k/T)}

where the logit ll is outputted from the pre-final layers for both teacher and student models. Besides, the objective function computed for each jj-th contributed transferrable weights is controlled by the corresponding EMD to ensure the learning convergence.

When the training in all silos is completed in each communication round, local model weights in all silos are aggregated to obtain global weights Θ=i=0N1ϑiθi\Theta = \sum^{N-1}_{i = 0 }\vartheta_i{\theta}_i, which are further utilized for downstream fine-tuning.

Next

In the next part, we will validate the effectiveness of our FedEFM on our Endovascular Intervention Dataset.

Like What You See?