Reducing Non-IID Effects in Federated Autonomous Driving with Contrastive Divergence Loss (Part 2)
In the previous post, we delved into the realm of federated learning in the context of autonomous driving, exploring its potential and the challenges posed by non-IID data distribution. We also discussed how contrastive divergence offers a promising avenue to tackle the non-IID problem in federated learning setups. Building upon that foundation, in this post, we will delve deeper into the integration of contrastive divergence loss into federated learning frameworks. We'll explore the mechanics of incorporating this loss function into the federated learning process and examine its potential implications for improving model performance and addressing non-IID data challenges in autonomous driving scenarios. We unravel the intricacies of leveraging contrastive divergence within federated learning paradigms to advance the capabilities of autonomous driving systems.
1. Overview
Motivation: The effectiveness of federated learning algorithms in autonomous driving hinges on two critical factors: firstly, the ability of each local silo to glean meaningful insights from its own data, and secondly, the synchronization among neighboring silos to mitigate the impact of the non-IID problem. Recent efforts have primarily focused on addressing these challenges through various means, including optimizing accumulation processes and optimizers, proposing novel network topologies, or leveraging robust deep networks capable of handling the distributed nature of the data. However, as highlighted by Duan et al., the indiscriminate adoption of high-performance deep architectures and their associated optimizations in centralized local learning scenarios can lead to increased weight variance among local silos during the accumulation process in federated setups. This variance detrimentally impacts model convergence and may even induce divergence, underscoring the need for nuanced approaches to ensure the efficacy of federated learning in autonomous driving contexts.
Siamese Network Approach: In our study, we propose a novel approach to directly tackle the non-IID problem within each local silo by addressing the challenges of learning optimal features and achieving synchronization separately. Our strategy involves implementing \textit{two distinct networks within each silo}: one network is dedicated to extracting meaningful features from local image data, while the other focuses on minimizing the distribution gap between the current model weights and those of neighboring silos. To facilitate this, we employ a Siamese Network architecture, comprising two branches. The first branch, serving as the backbone network, is tasked with learning local image features for autonomous steering using a local regression loss , while simultaneously incorporating a positive contrastive divergence loss to assimilate knowledge from neighboring silos. Meanwhile, the second branch, referred to as the sub-network, functions to regulate divergence factors arising from the backbone's knowledge through a contrastive regularizer term . See Figure.1 for more detail.
In practice, the sub-network initially adopts the same weights as the backbone during the initial communication round. However, starting from the subsequent communication rounds, once the backbone undergoes accumulation using Equation below, each silo's local model is trained using the contrastive divergence loss. The sub-network produces auxiliary features of identical dimensions to the output features of the backbone. Throughout training, we anticipate minimal discrepancies in weights between the backbone and the sub-network when employing the contrastive divergence loss. Synchronization of weights across all silos occurs when gradients from the backbone and sub-network learning processes exhibit minimal disparity.
where is the local regression loss for autonomous steering.
2. Contrastive Divergence Loss
In practice, we've noticed that the initial phases of federated learning often yield subpar accumulated models. Unlike other approaches that tackle the non-IID issue by refining the accumulation step whenever silos transmit their models, we directly mitigate the impact of divergence factors during the local learning phase of each silo. Our method aims to minimize the discrepancy between the distribution of accumulated weights from neighboring silos in the backbone network (representing divergence factors) and the weights specific to silo in the sub-network (comprising locally learned knowledge). Once the distribution between silos achieves an acceptable level of synchronization, we reduce the influence of the sub-network and prioritize the steering angle prediction task. Inspired by the contrastive loss of the original Siamese Network, our proposed Contrastive Divergence Loss is formulated as follows:
where is the positive contrastive divergence term and is the negative regularizer term; is the Kullback-Leibler Divergence loss function
where is the predicted representation, is dynamic soft label.
Consider in Equation above as a Bayesian statistical inference task, our goal is to estimate the model parameters by minimizing the Kullback-Leibler divergence between the measured regression probability distribution of the observed local silo and the accumulated model . Hence, we can assume that the model distribution has a form of , where is the normalization term. However, evaluating the normalization term is not trivial, which leads to risks of getting stuck in a local minimum. Inspired by Hinton, we use samples obtained through a Markov Chain Monte Carlo (MCMC) procedure with a specific initialization strategy to deal with the mentioned problem. Additionally inferred from Equation above, the can be expressed under the SGD algorithm in a local silo by setting:
where is the measured probability distribution on the samples obtained by initializing the chain at and running the Markov chain forward for a defined step.
Consider regularizer in Equation above as a Bayesian statistical inference task, we can calculate as in Equation above, however, the role of and is inverse:
We note that the key difference is that while the weight of the backbone is updated by the accumulation process from Equation above, the weight of the sub-network, instead, is not. This lead to different convergence behavior of contrastive divergence in and . The negative regularizer term will converge to state provided is bounded:
for any constraint. Note that, is the transition kernel.
Note that the negative regularizer term is only used in training models on local silos. Thus, it does not contribute to the accumulation process of federated training.
3. Total Training Loss
Local Regression Loss. We use mean square error (MAE) to compute loss for predicting the steering angle in each local silo. Note that, we only use features from the backbone for predicting steering angles.
where is the ground-truth steering angle of the data sample collected from silo .
Local Silo Loss. The local silo loss computed in each communication round at each silo before applying the accumulation process is described as:
In practice, we observe that both the contrastive divergence loss to handle the non-IID problem and the local regression loss for predicting the steering angle is equally important and indispensable.
Combining all losses together, at each iteration , the update in the backbone network is defined as:
where , is the number of local updates.
In parallel, the update in the sub-network at each iteration is described as:
Next
In the next post, we will evaluate the effectiveness of Constrative Divergence Loss in dealing with Non-IID problem in Federated Autonomous Driving.