FedEFM - Federated Endovascular Foundation Model with Unseen Data (Part 3)
In this part, we will deeply dive into the way to train our FedEFM and integrated the weights to various downstream task.
Method
Notations
We describe the notations used in the methodology
Notation | Description |
---|---|
Weights in silo in commuication round | |
Mini batch data in silo | |
Weights in silo that is transferred to silo | |
Successfully transferred weights from silo back to silo | |
Foundation loss function | |
Learning rate | |
In-neighbors of silo in the topology | |
Temperature for distillation | |
Accumulation status | |
Number of silos |
Federated Distillation
Figure 1 demonstrate the algorithm used for training a foundation model within a decentralized federated learning process, effectively addressing the issue of the unseen data problem.
Specifically, in the initial round, local model weights of each -th hospital silo is trained using their respective local data . Within the next communication round, we first perform overseas training where local model weights of each -th silo is transmitted to each of their -th neighbor hospital silo. This process aims to let local weights learn knowledge from the data of its -th neighbor silo.
In -th specific communication round, each transferred weight is optimized in -th silo using the following equation:
Then, we perform knowledge transfer where each learned overseas expert from the previous step is transferred back to the -th silo.
In the local silo , the local weight is updated based on both the original weight and the transferred weights that is learned from the neighbour silo . In particular, we aim to find regions that share similarities between two weights using the Earth Mover’s Distance . In this way, the distance measures the contribution of transferred weights during distillation, enabling the local silo to learn from its neighbors while avoiding divergence when weight convergence goals differ significantly. Local weights is then optimized using the following equation:
Differentiable Earth Mover's Distance
Assume that the input sample from -th local silo passes through the foundation architecture to generate the dense representation , where and denote the spatial size of the feature map and is the feature dimension. In a parallel manner, also denotes the dense representation when passes through .
Under Earth Mover circumstance, represents suppliers transporting goods to demanders . Then, between two feature sets and can be computed as:
where is conducted from optimal matching flow \tilde{X} = \{x_1, x_2, \ldots, x_{pq}\} $ for each sample pair of two sets $\mathbf{U} and ; is the cost per unit transported from supplier to demander and is obtained by computing the pairwise distance between embedding nodes and .
The cost per unit is computed as below and also plays a virtual role in computing the optimal matching flow:
where nodes with similar representations tend to generate small matching costs between each other. Then, the optimal matching flow is conducted by optimizing as below:
Here, EMD seeks an optimal matching between suppliers and demanders such that the overall matching cost is minimized. The global optimal matching flows can be achieved by solving a Linear Programming problem (LP). For the sake of completeness, we transform the above optimization to a compact matrix form
Here is our optimization variable. represents the equality constraint and denotes the inequality constraint in our optimization problem. Accordingly, the Lagrangian of the LP problem is given by:
where denotes the dual variables on the equality constraints and denotes the dual variables on the inequality constraints. Following the KKT conditions, we obtain the optimum of the objective function by solving with primal-dual interior point methods, where
Then, with the theorem below, we can derive the gradients of the LP parameters.
Suppose . Then, when all derivatives exist, the partial Jacobian of with respect to at the optimal solution , namely , can be obtained by satisfying:
Then, applying to the KKT conditions, the (partial) Jacobian with respect to can be defined as:
After obtaining the optimal , we can derive a closed-form gradient for , enabling efficient backpropagation without altering the optimization path.
Training
The distillation loss of -th silo based on student model loss is designed as:
where is the standard softmax output of the local student; is the ground-truth labels, is a hyper-parameter for controlling the importance of each loss component; are the softened outputs of the -th local student and the -th overseas teachers using the same temperature parameter
where the logit is outputted from the pre-final layers for both teacher and student models. Besides, the objective function computed for each -th contributed transferrable weights is controlled by the corresponding EMD to ensure the learning convergence.
When the training in all silos is completed in each communication round, local model weights in all silos are aggregated to obtain global weights , which are further utilized for downstream fine-tuning.
Next
In the next part, we will validate the effectiveness of our FedEFM on our Endovascular Intervention Dataset.