AIOZ AI / Sep 14th / 156 min read

Reducing Non-IID Effects in Federated Autonomous Driving with Contrastive Divergence Loss (Part 2)

Methodology of Contrastive Divergence Loss for Federated Autonomous Driving

type: research AI Federated Autonomous Driving

In the previous post, we delved into the realm of federated learning in the context of autonomous driving, exploring its potential and the challenges posed by non-IID data distribution. We also discussed how contrastive divergence offers a promising avenue to tackle the non-IID problem in federated learning setups. Building upon that foundation, in this post, we will delve deeper into the integration of contrastive divergence loss into federated learning frameworks. We'll explore the mechanics of incorporating this loss function into the federated learning process and examine its potential implications for improving model performance and addressing non-IID data challenges in autonomous driving scenarios. We unravel the intricacies of leveraging contrastive divergence within federated learning paradigms to advance the capabilities of autonomous driving systems.

1. Overview

Motivation: The effectiveness of federated learning algorithms in autonomous driving hinges on two critical factors: firstly, the ability of each local silo to glean meaningful insights from its own data, and secondly, the synchronization among neighboring silos to mitigate the impact of the non-IID problem. Recent efforts have primarily focused on addressing these challenges through various means, including optimizing accumulation processes and optimizers, proposing novel network topologies, or leveraging robust deep networks capable of handling the distributed nature of the data. However, as highlighted by Duan et al., the indiscriminate adoption of high-performance deep architectures and their associated optimizations in centralized local learning scenarios can lead to increased weight variance among local silos during the accumulation process in federated setups. This variance detrimentally impacts model convergence and may even induce divergence, underscoring the need for nuanced approaches to ensure the efficacy of federated learning in autonomous driving contexts.

Figure 1. The Siamese setup when our CDL is applied for training federated autonomous driving model. ResNet-8 is used in the backbone and sub-network in the Siamese setup. During inference, the sub-network will be removed. Dotted lines represent the backward process. Our CDL has two components: the positive contrastive divergence loss $\mathcal{L}_{\rm {cd^+}}$ and the negative regularize term $\mathcal{L}_{\rm {cd^-}}$ . The local regression loss $\mathcal{L}_{\rm {lr}}$ for automatic steering prediction is calculated only from the backbone network.

Siamese Network Approach: In our study, we propose a novel approach to directly tackle the non-IID problem within each local silo by addressing the challenges of learning optimal features and achieving synchronization separately. Our strategy involves implementing \textit{two distinct networks within each silo}: one network is dedicated to extracting meaningful features from local image data, while the other focuses on minimizing the distribution gap between the current model weights and those of neighboring silos. To facilitate this, we employ a Siamese Network architecture, comprising two branches. The first branch, serving as the backbone network, is tasked with learning local image features for autonomous steering using a local regression loss $\mathcal{L}_{\rm {lr}}$ , while simultaneously incorporating a positive contrastive divergence loss $\mathcal{L}_{\rm {cd^+}}$ to assimilate knowledge from neighboring silos. Meanwhile, the second branch, referred to as the sub-network, functions to regulate divergence factors arising from the backbone's knowledge through a contrastive regularizer term $\mathcal{L}_{\rm {cd^-}}$ . See Figure.1 for more detail.

In practice, the sub-network initially adopts the same weights as the backbone during the initial communication round. However, starting from the subsequent communication rounds, once the backbone undergoes accumulation using Equation below, each silo's local model is trained using the contrastive divergence loss. The sub-network produces auxiliary features of identical dimensions to the output features of the backbone. Throughout training, we anticipate minimal discrepancies in weights between the backbone and the sub-network when employing the contrastive divergence loss. Synchronization of weights across all silos occurs when gradients from the backbone and sub-network learning processes exhibit minimal disparity.

\theta_i\left(k + 1\right) = {\theta}_i\left(k\right)-\alpha_{k}\frac{1}{m}\sum^m_{h=1}\nabla \mathcal{L}_{\rm {lr}}\left({\theta}_i\left(k\right),\xi_i^h\left(k\right)\right)

where $\mathcal{L}_{\rm {lr}}$ is the local regression loss for autonomous steering.

2. Contrastive Divergence Loss

In practice, we've noticed that the initial phases of federated learning often yield subpar accumulated models. Unlike other approaches that tackle the non-IID issue by refining the accumulation step whenever silos transmit their models, we directly mitigate the impact of divergence factors during the local learning phase of each silo. Our method aims to minimize the discrepancy between the distribution of accumulated weights from neighboring silos in the backbone network (representing divergence factors) and the weights specific to silo $i$ in the sub-network (comprising locally learned knowledge). Once the distribution between silos achieves an acceptable level of synchronization, we reduce the influence of the sub-network and prioritize the steering angle prediction task. Inspired by the contrastive loss of the original Siamese Network, our proposed Contrastive Divergence Loss is formulated as follows:

\mathcal{L}_{\rm {cd}} = \beta \mathcal{L}_{\rm {cd^+}} + (1-\beta) \mathcal{L}_{\rm {cd^-}} = \beta \mathcal{H}(\theta^b_i, \theta^s_i) + (1-\beta) \mathcal{H}(\theta^s_i,\theta^b_i)

where $\mathcal{L}_{\rm {cd^+}}$ is the positive contrastive divergence term and $\mathcal{L}_{\rm {cd^-}}$ is the negative regularizer term; $\mathcal{H}$ is the Kullback-Leibler Divergence loss function

\mathcal{H}(\hat{y},y) = \sum \mathbf{f}(\hat{y}) \log\left(\frac{\mathbf{f}(\hat{y})}{\mathbf{f}(y)}\right)

where $\hat{y}$ is the predicted representation, $y$ is dynamic soft label.

Consider $\mathcal{L}_{\rm {cd^+}}$ in Equation above as a Bayesian statistical inference task, our goal is to estimate the model parameters $\theta^{b*}$ by minimizing the Kullback-Leibler divergence $\mathcal{H}(\theta^b_i, \theta^s_i)$ between the measured regression probability distribution of the observed local silo $P_0 (x|\theta^s_i)$ and the accumulated model $P (x|\theta^b_i)$ . Hence, we can assume that the model distribution has a form of $P (x|\theta^b_i) = e^{-E(x,\theta^b_i)}/Z(\theta^b_i)$ , where $Z(\theta^b_i)$ is the normalization term. However, evaluating the normalization term $Z(\theta^b_i)$ is not trivial, which leads to risks of getting stuck in a local minimum. Inspired by Hinton, we use samples obtained through a Markov Chain Monte Carlo (MCMC) procedure with a specific initialization strategy to deal with the mentioned problem. Additionally inferred from Equation above, the $\mathcal{L}_{\rm {cd^+}}$ can be expressed under the SGD algorithm in a local silo by setting:

\mathcal{L}_{\rm {cd^+}} = -\sum_{x}P_0 (x|\theta^s_i)\frac{\partial E(x;\theta^b_i)}{\partial \theta^b_i} + \sum_{x}Q_{\theta^b_i} (x|\theta^s_i)\frac{\partial E(x;\theta^b_i)}{\partial \theta^b_i}

where $Q_{\theta^b_i} (x|\theta^s_i)$ is the measured probability distribution on the samples obtained by initializing the chain at $P_0 (x|\theta^s_i)$ and running the Markov chain forward for a defined step.

Consider $\mathcal{L}_{\rm {cd^-}}$ regularizer in Equation above as a Bayesian statistical inference task, we can calculate $\mathcal{L}_{\rm {cd^-}}$ as in Equation above, however, the role of $\theta^s$ and $\theta^b$ is inverse:

\mathcal{L}_{\rm {cd^-}}=-\sum_{x}P_0 (x|\theta^b_i)\frac{\partial E(x;\theta^s_i)}{\partial \theta^s_i} + \sum_{x}Q_{\theta^s_i} (x|\theta^b_i)\frac{\partial E(x;\theta^s_i)}{\partial \theta^s_i}

We note that the key difference is that while the weight $\theta^b_i$ of the backbone is updated by the accumulation process from Equation above, the weight $\theta^s_i$ of the sub-network, instead, is not. This lead to different convergence behavior of contrastive divergence in $\mathcal{L}_{\rm {cd^+}}$ and $\mathcal{L}_{\rm {cd^-}}$ . The negative regularizer term $\mathcal{L}_{\rm {cd^-}}$ will converge to state $\theta^{s*}_i$ provided $\frac{\partial E}{\partial \theta^s_i}$ is bounded:

g(x,\theta^s_i) = \frac{\partial E(x;\theta^s_i)}{\partial \theta^s_i} -\sum_{x}P_0 (x|(\theta^b_i,\theta^s_i))\frac{\partial E(x;\theta^s_i)}{\partial \theta^s_i}

(\theta^s_i - \theta^{s*}_i)\cdot\left\{ \sum_{x}{P_0(x)g(x,\theta^s_i)} - \sum_{x',x}{P_0(x')\mathbf{K}^m_{\theta^s_i}(x',x)g(x,\theta^{s*}_i})\right\}\geq \mathbf{k}_1|\theta^s_i-\theta^{s*}_i|^2

for any $\mathbf{k}_1$ constraint. Note that, $\mathbf{K}^m_{\theta^s}$ is the transition kernel.

Note that the negative regularizer term $\mathcal{L}_{\rm {cd^-}}$ is only used in training models on local silos. Thus, it does not contribute to the accumulation process of federated training.

3. Total Training Loss

Local Regression Loss. We use mean square error (MAE) to compute loss for predicting the steering angle in each local silo. Note that, we only use features from the backbone for predicting steering angles.

\mathcal{L}_{\rm {lr}} = \text{MAE}(\theta^b_i, \hat{\xi}_i )

where $\hat{\xi}_i$ is the ground-truth steering angle of the data sample $\xi_i$ collected from silo $i$ .

Local Silo Loss. The local silo loss computed in each communication round at each silo before applying the accumulation process is described as:

\mathcal{L}_{{\rm{final}}} = \mathcal{L}_{\rm {lr}} + \mathcal{L}_{\rm {cd}}

In practice, we observe that both the contrastive divergence loss $\mathcal{L}_{\rm {cd}}$ to handle the non-IID problem and the local regression loss $\mathcal{L}_{\rm {lr}}$ for predicting the steering angle is equally important and indispensable.

Combining all losses together, at each iteration $k$ , the update in the backbone network is defined as:

\theta^b_i\left(k + 1\right) =\begin{cases} \sum_{j \in \mathcal{N}_i^{+} \cup{\{i\}}}\textbf{A}_{i,j}{\theta}^b_{j}\left(k\right), \textit{ \quad \quad \quad if k} \equiv 0 \pmod{u + 1},\\ {\theta}^b_i\left(k\right)-\alpha_{k}\frac{1}{m}\sum^m_{h=1}\nabla \mathcal{L}_{\rm {b}}\left({\theta}^b_i\left(k\right),\xi_i^h\left(k\right)\right), \text{otherwise.} \end{cases}

where $\mathcal{L}_{\rm {b}} = \mathcal{L}_{\rm {lr}} + \mathcal{L}_{\rm {cd^+}}$ , $u$ is the number of local updates.

In parallel, the update in the sub-network at each iteration $k$ is described as:

\theta^s_i\left(k + 1\right) ={\theta}^s_i\left(k\right)-\alpha_{k}\frac{1}{m}\sum^m_{h=1}\nabla \mathcal{L}_{\rm {cd^-}}\left({\theta}^s_i\left(k\right),\xi_i^h\left(k\right)\right)

In the next post, we will evaluate the effectiveness of Constrative Divergence Loss in dealing with Non-IID problem in Federated Autonomous Driving.

Like What You See?

Star AIOZ AI

Star the repo to support us.

Like AIOZ AI

Follow the fanpage to support us.

#1. Overview

#2. Contrastive Divergence Loss

#3. Total Training Loss

#Next

Like What You See?

1. Overview

2. Contrastive Divergence Loss

3. Total Training Loss

Next