where \({{{\psi }}}^{\left({{\rm{common}}}{\prime} \right)}\) is the new common branch after merging; ψ(common) and ψ(redundant) represent common and redundant branches, respectively. Then we expand the important branches to adapt the client’s computing capabilities:

$${{{\psi }}}^{\left({{\rm{important}}}{\prime} \right)}\leftarrow {{{\psi }}}^{({{\rm{important}}})}-({{{\psi }}}^{(1)}+\cdots+{{{\psi }}}^{(n)}),$$

(15)

where \({{{\psi }}}^{\left({{\rm{important}}}{\prime} \right)}\) are the expanded parameters of the important branch. In order to ensure the same output, we need to subtract the parameters of the new branch \({\{{{{\psi }}}^{(i)}\}}_{i=1}^{n}\) from the parameters of the original important branch ψ(important). The parameters of the new branch are generated randomly.

To ensure the stability and effectiveness of training, we adopt a fixed local reparameterization strategy for transformer-based models throughout the training process, unless computational resources change. Transformer architectures are particularly sensitive to parameter variations, and frequent changes to the reparameterization strategy may adversely affect their convergence and performance.

Lossless Knowledge Transfer

This section introduces the re-parameterization techniques for CNNs and transformers, demonstrating how these methods enable flexible structural transformations while preserving model outputs.

Re-parameterization for CNN

As indicated by works55,56, the 2D convolutions hold the property of additivity:

$${{{\boldsymbol{I}}}} \circledast {{{{\boldsymbol{F}}}}}^{(1)}+{{{\boldsymbol{I}}}} \circledast {{{{\boldsymbol{F}}}}}^{(2)}={{{\boldsymbol{I}}}} \circledast \left({{{{\boldsymbol{F}}}}}^{(1)}+{{{{\boldsymbol{F}}}}}^{(2)}\right).$$

(16)

where I, F(1) and F(2) are the input and kernels, respectively. The above equation is satisfied even with different kernel sizes. Some widely used operations in CNN —- average pooling and batch normalization —- can be converted into a convolution operation. The above additivity property ensures that a single convolution can be equivalently transformed to multi-branch operations, and vice versa. The equivalent transformations of operations guarantee lossless knowledge transfer, since the model outputs are not changed along with the network structure adjustment.

Re-parameterization for Transformer

The re-parameterization technique can also be adapted to transformer architectures. In the case of transformers, the linear layers hold the property of additivity, similar to the convolution operation in CNNs:

$${{{\boldsymbol{X}}}}\cdot {{{{\boldsymbol{W}}}}}^{(1)}+{{{\boldsymbol{X}}}}\cdot {{{{\boldsymbol{W}}}}}^{(2)}={{{\boldsymbol{X}}}}\cdot \left({{{{\boldsymbol{W}}}}}^{(1)}+{{{{\boldsymbol{W}}}}}^{(2)}\right),$$

(17)

where X is the input feature matrix, and W(1) and W(2) are the weight matrices of two linear layers. This equation implies that multiple parallel linear layers can be equivalently merged into a single linear layer, or vice versa.

Moreover, operations commonly used in transformer architectures, such as layer normalization and residual connections, can also be transformed into equivalent forms compatible with the re-parameterization technique. Specifically, layer normalization can be expressed as an affine transformation, which can be absorbed into the parameters of linear layers. Residual connections, being additive in nature, align naturally with the additivity property of linear operations.

The re-parameterization process for transformers involves expanding a single linear layer into multiple parallel linear layers, optionally combined with normalization layers (e.g., batch normalization or layer normalization). These parallel branches can then be merged back into a single equivalent linear layer without loss of information. This guarantees lossless knowledge transfer, as the outputs of the model remain unchanged before and after the structural reconfiguration:

$${{{\boldsymbol{X}}}}\cdot {{{\boldsymbol{W}}}}\,{{\rm{merged}}}\,={{{\boldsymbol{X}}}}\cdot ({{{{\boldsymbol{W}}}}}^{(1)}+{{{{\boldsymbol{W}}}}}^{(2)}),$$

(18)

where Wmerged is the weight matrix of the merged linear layer.

It is worth noting that, according to Eq. (16) and (17), the re-parameterization process in DynamicFL ensures mathematical equivalence between the original and transformed model structures, guaranteeing that the input-output mapping remains unchanged. As a result, when evaluating the model on the same test set, its accuracy remains identical, thereby ensuring fairness in the models obtained by institutions with different computational resources.

Convergence Analysis

In this subsection, we present the convergence analysis of DynamicFL. We consider the following standard assumptions commonly made in FL analysis57,58,59:

Assumption 1

(L-smoothness and σ-uniformly bounded gradient variance).

(a) F is L-smooth, i.e., \(F(u)\le F(x)+\left\langle \nabla F(x),u-x\right\rangle+\frac{1}{2}L{\left\Vert u-x\right\Vert }^{2}\) for any \(u,x\in {{\mathbb{R}}}^{d}\).

(b) There exists a constant Gmax > 0 such that: \({\mathbb{E}}(| | \nabla {F}^{(i)}(x)| {| }^{2})\le {G}_{max}^{2},\quad \forall i\in (N),\forall {{{\bf{x}}}}\in {{\mathbb{R}}}^{d}\), where F(i)(x) is an unbiased stochastic gradient of f(i) at x.

(c) f(x) has σ2 -bounded variance, i.e., \({{\mathbb{E}}}_{\xi \sim {{{{\mathcal{S}}}}}_{i}}\Vert \nabla {F}_{i}({{{\bf{x}}}})-\nabla {f}_{i}({{{\bf{x}}}})\Vert \le {\sigma }^{2},\quad \forall i\in (N),\forall {{{\bf{x}}}}\in {{\mathbb{R}}}^{d}.\)

Lemma 2

For any reparameterization of a convolutional layer l that can be represented as a summation of N convolutional branches with weights \({W}_{l,n}^{
(19)

where \({{{{\mathcal{G}}}}}_{l}={\sum }_{n=1}^{N}{M}_{l,n}\). Therefore, it can be seen as spatial gradient scaling applied to the original convolution. Here we assume \({{{{\mathcal{G}}}}}_{l}\le {{{\mathcal{G}}}}\). In FL, it can be expressed that for the local model \({\zeta }_{i,k}^{
(20)

Here, we note that the variable k denotes the current local epoch, and each client’s structure of \({\zeta }_{i,k}^{
(21)

This reparameterization ensures that the structure of ϕ
(22)

Theorem 3

The sequence generated by our method with stepsize  η ≤ 1L satisfies

$$\frac{1}{T}{\sum }_{t=1}^{T}{\mathbb{E}}\left({\left\Vert \nabla f\left({w}_{t}^{(g)}\right)\right\Vert }^{2}\right)\le \frac{2}{\eta T}\left({\mathbb{E}}\left(f\left({w}_{1}^{(g)}\right)\right)-\right.\\ \left.f\left({w}_{T}^{(g)}\right)\right)+4{\eta }^{2}{{{{\mathcal{G}}}}}^{2}{L}^{2}{G}^{2}{K}^{2}+\frac{L}{N}\eta {\sigma }^{2}.$$

(23)

Corollary 4

When the function f is lower bounded with \(f({w}_{1}^{(g)})-{f}^{*}\le \Delta\) and the number rounds T is large enough, then set the stepsize \(\eta=\frac{\sqrt{N}}{L\sqrt{T}}\) yields

$$\frac{1}{TK}\sum _{t=1}^{T}\sum \limits_{k=1}^{K}{\mathbb{E}}\left({\left\Vert \nabla f\left({w}_{t}^{(g)}\right)\right\Vert }^{2}\right)=O\left(\frac{2L\Delta+{\sigma }^{2}}{\sqrt{NT}}+\frac{N}{T}\right).$$

(24)

The dominance of the first term in our algorithm ensures that it shares the same convergence speed, \(O(1/\sqrt{NT})\), as the vanilla FedAvg.

Statistics & Reproducibility

This study is based on a publicly available dataset. No statistical method was used to predetermine sample size. No data were excluded from the analyses. Since the dataset is pre-collected and publicly available, no randomization or blinding was applicable. The machine learning models were trained using standard procedures, and all experiments were conducted with fixed hyperparameters unless otherwise specified.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Leave A Reply