A framework for self-supervised federated domain adaptation

Unsupervised federated domain adaptation uses the knowledge from several distributed unlabelled source domains to complete the learning on the unlabelled target domain. Some of the existing methods have limited effectiveness and involve frequent communication. This paper proposes a framework to solve the distributed multi-source domain adaptation problem, referred as self-supervised federated domain adaptation (SFDA). Specifically, a multi-domain model generalization balance is proposed to aggregate the models from multiple source domains in each round of communication. A weighted strategy based on centroid similarity is also designed for SFDA. SFDA conducts self-supervised training on the target domain to tackle domain shift. Compared with the classical federated adversarial domain adaptation algorithm, SFDA is not only strong in communication cost and privacy protection but also improves in the accuracy of the model.

performance by establishing transferable features from multiple labelled source domains to unlabelled target domains.Many UMDA methods combine the data from the target domain with the data from each source domain to form several source-target domain pairs, and then establish transferable characteristics by narrowing the distance between the domains [8].This paper focuses on UMDA scenario.
In addition to the adaption problems, data privacy protection is also receiving increasing attention.To protect privacy, raw data cannot be shared between domains during the training of the model.Based on this limitation, federated learning [9] is used to solve this problem.Federated learning can be considered a kind of distributed machine learning, but it is very different from traditional distributed machine learning where it has very strict requirements on privacy and efficiency, and the data distribution of each node is non-independent and identically distributed(non-IID), which causes there to be obvious domain shifts between nodes in a specific task.Inspired by federated learning and unsupervised domain adaptation, Peng et al. [10] proposed unsupervised federated domain adaptation (UFDA).In addition, they also mainly introduced a method to solve the problem of UFDA by using federated adversarial domain adaptation (FADA).However, FADA not only has poor accuracy but also has high communication costs and is prone to privacy leakage.This adversarial training method requires each source domain to exchange and update model parameters with the target domain after each minibatch, which undoubtedly consumes a large amount of communication resources, to some extent, and increases the risk of privacy leakage.In contrast, the multidomain model generalization balance (MDMGB) in this paper reduces the impact of the two defects and improves accuracy.At the same time, the performance of the model is further enhanced with information maximization and pseudo-label technology.We coin the whole process, including MDMGB, as self-supervised federated domain adaptation (SFDA).
Our main contributions are summarized as follows.
1. Propose an architecture which efficiently and effectively transfers knowledge learned from multiple source domains to the target domain.2. Develop a weighting strategy based on the centroid similarity between the source and target domains.The proposed strategy does not require the sharing of raw data between domains.3. Our approach is empirically evaluated over two benchmark datasets, and compared with existing methods, the performance of our method is significantly improved.
Section 2 presents the related work.Section 3 describes the details of multi-domain model generalization balance and self-supervised federated domain adaptation.Section 4 presents the experiments and evaluates the results.The experiments are provided and analysed in Sect. 4. Finally, Sect. 5 concludes this work and discuss future work in Sect. 5.

Unsupervised multi-source domain adaptation
UMDA aims to transfer knowledge learned from multiple labelled source domains to a single unlabelled target domain.Many unsupervised multi-source domain adaptation methods are based on the theory of learning from different domains proposed by Ben-David et al. [11].At present, there are two mainstream strategies to learn transferable features.One is the moment matching method represented by the maximum mean discrepancy (MMD) and correlation alignment (CORAL) [12][13][14][15][16][17][18], and the other is the adversarial training method [17,[19][20][21][22].In the first kind of method, MMD [12] and CORAL [14] are designed to match first-order (mean) and second-order (covariance) statistics of different distributions, respectively, while HoMM [15] can perform moment tensor matching of any order.The maximum mean discrepancy is often used to measure the distance between two distributions and is a commonly used loss function in transfer learning.MMD is an effective measure that can compare different distributions without initial estimation of the density function.In domain adaptation, the original MMD [12] is defined as where p and q represent the probability distributions of the source domain and target domain, respectively.sup stands for supremum and E stands for expectation value.F is a set of functions in RKHS (reproducing kernel Hilbert space) whose norm is less than or equal to 1.However, the above form cannot be calculated directly and requires the use of the kernel trick.The most commonly used kernel function is the Gaussian kernel function, By replacing the expectation with the mean value of each small batch, the following computable form can be obtained: Among the second kind of methods, the representative algorithm is DANN (domain adversarial neural network), which for the first time introduces the idea of adversarial training [23] into the field of transfer learning and learns the features with category (1) ( discrimination and domain invariance through the joint optimization of feature extractor, label predictor and domain classifier.The feature extractor and domain classifier are equivalent to the generator and discriminator in the generative adversarial network, respectively, and they form an adversarial relationship to promote the emergence of features with domain invariance.

Domain generalization
Unlike domain adaptation, domain generalization [8,24,25] cannot use any sample of the target domain, but it still has to capture transferable information across domains.To complete the classification task without the target domain available at the time of training, labelled data from several related classification tasks can be used.Meta-learning involves generalization to a new task [13], but because meta-learning is more concerned with how quickly the model converges when the labelled data for the new task are acquired (in small amounts), it assumes that the labelled data for the new task are accessible.When a new task appears, meta-learning needs to be retrained, but the domain generalization model does not need to be retrained.

Federated learning
Federated learning (FL) is a machine learning setup that aims to train a high-quality centralized model while training data remain distributed over a large number of clients with unreliable and relatively slow network connections for each client.For the optimization of communication problems, many effective methods [26] have been proposed, so this paper focuses on how to learn a centralized model with high performance.The federated optimization problem has four key properties: non-IID: the data on each client are specific to the user; unbalanced: the amount of data generated by the client varies greatly; massively distributed: the number of clients participating in an optimization is much larger than the average number of examples per client; and limited communication: network connections are unstable, insecure or expensive.The third feature has received less attention because most of the data are typically concentrated in a very small number of nodes, with the remaining nodes contributing almost nothing to the results.The federated averaging (FedAvg) algorithm [9] is a basic algorithm that aggregates model updates from different clients in each round of communication.Its key point is that each client model must have the same random initialization except that the server distributes the aggregated model to the client in each round of communication.However, Fed-Avg's weighting of the client model is based on the amount of data, and the amount of data does not correlate significantly with differences in the domain distribution.Therefore, it cannot be simply applied to unsupervised multi-source domain adaptation.

Federated domain adaptation
There has been very little discussion of distributed UMDA, but there is no lack of excellent articles, the most advanced and representative of which is the recently proposed KD3A (knowledge distillation-based decentralized domain adaptation) [27].However, its computational efficiency is relatively low because each iteration is trained in the source domain in the early stage and the target domain in the late stage.When either party is training, the other party has to idle for a long time and wait for the return.The concept of unsupervised federated domain adaptation (UFDA) was first proposed in FADA [10].UFDA studies how to transfer knowledge learned from decentralized nodes to a new node with a different data distribution.
UFDA presents three challenges: first, the data are stored locally on the client and cannot be shared, which makes mainstream domain adaptation approaches unfeasible because they require both labelled source domain data and unlabelled target domain data.For each node, the model parameters are trained separately and converge at different rates, and their contribution to the target model depends on how close each source domain is to the target domain.Finally, the knowledge learned from each source domain is highly entangled, which is likely to result in a negative transfer.SHOT [28] proposed a self-supervised method to solve the domain adaptation problem of the separation of a single source domain and target domain.(The two do not share data.)It produces the same number of result models as the number of source domains, which increases the overhead of storage space.In addition, each model is involved in the calculation of the results.
To solve the problem of multi-source model aggregation in federated domain adaptation, this paper proposes a multi-domain model generalization balance algorithm (MDMGB).The weighting strategy of this method abandons the traditional weighting method which depends on the quantity of data, and measures the tightness of the source domain and target domain by the similarity of the centroid.MDMGB can calculate weights without sharing data between domains.In addition, to improve the performance of domain adaptation under federated learning constraints, we propose a self-supervised federated domain adaptation framework (SFDA).MDMGB is used for the aggregation of each source model in each communication.SFDA completely separates the training of the source domain model from the training of the target domain model and simultaneously trains the target model in a self-supervised way.

Methods
Let D S and D T represent the source and target domains, respectively.In the unsuper- vised multi-source domain adaptation, there are a total of K source domains , and a target domain with N T unlabelled samples, i.e.
. The goal of UMDA is to learn a model h that minimizes the task-specific loss ε D T on the target domain D T , where . The general UMDA model learns transferable features by minimizing H-divergence.In the context of distributed UMDA, this paper does not need to access the data of each source domain, but studies the domain adaptation problem under the federated learning framework.Each domain that contains the target domain and the source domains is treated as a client in federated learning.If conditions permit, the client of the target domain can also act as the server.

Centroid alignment strategy
The centroid of each class in the domain is obtained by the following formula [29]: is the k-dimensional output of each sample, g represents the feature extractor and h represents the classifier.δ k (a) = exp(a k ) i exp(a i ) refers to the kth element in the softmax output of the k-dimensional vector a.These centroids can stably and more reliably represent the distribution of different classes in the domain.Then, the closest centroid classifier is used to obtain the pseudo-label of each sample: D f (a, b) measures the cosine distance between a and b.The specific form is as follows: The centroids are constantly updated in an iterative manner through the following formula: (5) Fig. 1 MDMGB algorithm framework.From the figure, it can be seen the structure of the MDMGB algorithm and processes among sections.
However, one update is usually good enough.It is worth mentioning that in practical calculations, a one-dimensional nonzero number (such as 1) is appended to each eigenvector to avoid division by zero.

The algorithm of MDMGB
Figure 1 shows the structure of the MDMGB algorithm.The weight of each source domain is determined according to its distance (tightness) from the target domain.Since the weight calculation requires the use of the target domain, MDMGB omits steps 2 and 3 when the target domain is not yet available.For UMDA, there are many methods to calculate the weight of the source domain based on the distance between the source domain and the target domain, but they do not work in the UFDA condition due to the need to obtain both source domain and target domain data.In contrast to existing distributed UMDA approaches, MDMGB aggregates the model rather than the original gradient.In addition, MDMGB does not use the amount of data on each source domain in the process of weighting the source domain; for specific reasons, please refer to Appendix A. It should be emphasized that MDMGB is built on the premise that each source domain has a sufficiently large amount of data, corresponding to the critical few nodes in federated learning, and that the data volume of each node cannot differ by orders of magnitude.
The following is the weighting strategy of the MDMGB algorithm.Assume that we now have the uploaded source models w 1 , w 2 , . . ., w K from K source domains.A pre- liminary model was obtained by averaging them directly: Next, w 0 is distributed to the target domain and each source domain to compute and return the centroid for each category as described below.At this point, assuming that the number of categories is L, the centroids of each source domain are and the centroids of the target domain are c T 0 , c T 1 , . . ., c T L−1 .The size of a set of centroids uploaded by each domain is only related to the number of categories and the size of the eigenvector, which are small relative to the model.This process has almost no impact on communication efficiency.Then, the cosine similarity of each centroid of each source domain and the corresponding centroid of the target domain is calculated and summed: (8) Since the cosine similarity ranges from [− 1, 1], for sum k to be non-negative, the final sum value is: Finally, we use this sum value to calculate the weights: At this point, the weights obtained can be used to re-aggregate the source models.

Training the source model
First, the model (the same model distributed by the server) is trained separately on each client (source domain) in a federated learning manner.During the training process, the model parameters are communicated once for several iterations.Label smoothing (LS) is used to improve the model's generalization ability as each source domain client trains the model locally.The standard cross-entropy loss function is f s is the output of the source domain sample x s on the source model, and q is a k-dimen- sional vector, all of which are "0" except that it is "1" at the position corresponding to the correct label.To improve the discriminating ability of the model, LS is applied to source model learning.With the label smoothing technique, the loss function becomes: (12)  K represents the number of categories and α is the smoothing factor, which is usually set to 0.1.
During each round of communication, the client sends its own model to the server for aggregation (apply MDMGB).Finally, the server distributes the updated model to each client.This process is performed several times until the model converges.If the target domain is still unknown at this time, the unified model obtained can be considered the result of domain generalization.
The domain can be understood as a kind of interference information [30,31].An image in a digit recognition dataset is made up of numbers and interfering information (background colour, light intensity, line thickness, etc.).The difference between the images in different domains is caused by the interference information, while the basic characteristics of the numbers themselves remain unchanged.In addition, the learned model represents the knowledge expressed in each source domain for a specific task, and this knowledge contains interference information and corresponding basic characteristics.
In distributed training, it is very important to set the number of communication rounds r.Since the models in different source domains have different convergence rates, the models need to be aggregated r times in each iteration.Based on this, each iteration can be equally divided into r stages, and the model can be aggregated after each stage.r = 0.2 means aggregation once every 5 iterations; r = 1 means that it aggregates once per iteration.It has been shown that for better performance, more communication is needed during training (in general, more rounds are better).However, an unrestricted increase in the number of communication rounds not only makes the improvement of accuracy less and less obvious but it also increases the communication cost proportionally [9,16,19,29].

Training the classifier of the target model
First, the target model is initialized with the trained source model, then the feature extractor is frozen (parameters are not updated during training), and the classifier of the target model is trained.It is worth noting that SHOT fixes the classifier when training the target model and only trains the feature extractor.We fix the feature extractor and train the target model classifier for the target domain.
The loss function of target model training is composed of two parts: information maximization (IM) loss and pseudo-labelling (PL) cross-entropy loss.
(1) Information maximization The ideal probability vector output should be similar to the one-hot encoding, but the probability vectors should be different from each other for different classes of samples.Therefore, in this paper, information maximization loss is applied to maximize the probability of labels with maximum probability (individual certainty of predicted results), (15) L ls src f s ; X s , Y s = −E (xs,ys)∈Xs×Ys while the number of predicted labels of each category is basically the same (global diversity of predicted results).IM loss is composed of L ent and L div : where p = E x t ∈X t δ f t (x t ) is the mean value of the probability vector of the whole tar- get domain by element operation.L ent f t ; X t makes the prediction result of each sample more certain; L div f t ; X t promotes the diversity of the probability vector outputs.
(2) Pseudo-labelling Using IM losses alone can lead the model to go in the wrong direction.This negative effect comes from inaccurate network output.For example, a target domain sample from the first class that passes through the network with a probability vector of [0.34, 0.36, 0.1, 0.1, 0.1] might be forced to have an expected output [0.0, 1.0, 0.0, 0.0 0.0].To mitigate this effect, pseudo-labels should be applied to each unlabelled sample to better supervise the training of the classifier.In this paper, a self-supervised pseudo-labelling strategy is applied to reduce the impact of such adverse factor.
Firstly, calculate the centroid of each category on the target domain as described in Subsection 3.1, and then, pseudo-labels are generated on the basis of the centroids c T : Because ŷt is generated by centroids generated in an unsupervised manner, ŷt is called a self-supervised pseudo-label.
In summary, given the source model f s (x) = h s g s (x) and the pseudo-labels above, this paper fixes the feature extractor g t = g s to learn the classifier h t , and the total loss function on the target domain is as follows: where β is greater than or equal to 0, which is used to control the weight of the pseudolabel cross-entropy loss.
In the end, only one result model is generated, and the training of the source model and target model is completely separated.After the decoupling of the two-step operation, the efficiency is substantially improved.
From the perspective of the federated optimization problem, the source domains involved here correspond to a small number of nodes that play a key role in federated learning.Moreover, the distribution differences among several source domains also fully reflect the property of non-IID in federated optimization.Both MDMGB and full SFDA can be extended to other network architectures very directly.Algorithm 1 provides the complete pseudo-code of the SFDA training process.(17

Results and discussion
This section first introduces datasets and settings for experiments.Then, we compare the SFDA with some baselines in terms of accuracy.Subsection 4.4 analyses the effects of label smoothing and pseudo-labelling.To some extent, Subsection 4.5 reflects the feasibility of SFDA in actual scenarios.Subsection 4.6 illustrates the advantages of SFDA in communication efficiency.

Datasets and settings
In this paper, SFDA was evaluated on the DigitFive and DomainNet datasets.DigitFive, as a benchmark dataset, is widely used.It contains five digit recognition datasets, namely MNIST, MNIST-M, SYN, SVHN and USPS.DomainNet is by far the largest multisource domain adaptation dataset, with a total of 6 domains (Clipart, Infograph, Painting, Quickdraw, Real and Sketch), 345 categories and approximately 600,000 images.Figure 3 shows some sample data (images) from DigitFive and DomainNet.Following the previous settings [13], this paper uses a three-layer CNN as the backbone network on DigitFive and pre-trained ResNet101 on DomainNet.In the experiments, we set each domain (dataset) as the target domain in turn and all the remaining domains as the source domains, and then calculate the single accuracy and the average accuracy.

Baselines
In this paper, the advantages of SFDA are highlighted by comparison with the following methods: DAN [18], multilayer adaptation with multi-kernel MMD; M 3 SDA [32], dynamically aligning moments of feature distributions; DANN, based on adversarial training; FADA, advanced distributed UMDA method.This paper designed two baseline experiments without domain adaptation, namely oracle and source-only.Oracle conducts supervised learning directly on the target domain.Source-only simply combines source domains to form a hybrid domain to train a single model in a supervised learning manner.

Performance on DigitFive
Table 1 shows that SFDA exceeds FADA in both single and average accuracy.The improvements in accuracy ranged from approximately 6 percentage points to approximately 23 percentage points.In addition to the results when MNIST-M was used as the target domain, SFDA also showed great improvement in single accuracy and average accuracy compared with M 3 SDA.For the accuracy of MNIST-M as the target domain, the difference between the two is less than 0.5%.In general, the performance of distributed UMDA is not as good as that of the common multi-source domain adaptation method due to the inability to obtain both source domain and target domain data at the same time and the limitations of communication.However, the accuracy of SFDA not only exceeds the distributed unsupervised multi-source domain adaptation method FADA to a large extent but also exceeds the common multi-source domain adaptation method M 3 SDA in general.

Influence of label smoothing and pseudo-labelling
To verify the effectiveness of the label smoothing technique and pseudo-label technique in this paper, we designed a comparative experiment, and the results are shown in Table 2.According to the results in Table 2, the average accuracy increased from 88.80% to 89.03% after the application of pseudo-labelling technology.Without the use of pseudo-labelling technology, the accuracy on SVHN is significantly reduced.As mentioned earlier, models can go in the wrong direction.The accuracy of the model without LS was 89.03%, while the accuracy of the model with label smoothing was 89.17% because the standard cross-entropy loss function would make the source model overfit the source domain, which would lead to the decline of the generalization ability of the model.Following the practice of SHOT [29], that is, only the feature extractor is trained on the target domain, the performance of the target model degrades very seriously.The learning rate in the experiment is the best value obtained by multiple experiments; that is, under the current setting, the value of the learning rate brings the highest accuracy.However, this article does not exhaust all possible values.

Performance on DomainNet
To fully verify the generalization performance of the model, we selected a large Domain-Net dataset to conduct experiments and simulated six scenes: Clipart, Infograph, Painting, Quickdraw, Real and Sketch.The experimental results are shown in Table 3.The models are compared vertically and the results horizontally when each dataset takes turns as the target domain.Table 3 shows that, similarly, SFDA significantly exceeds FADA in terms of average accuracy.The SFDA performs better on 4 of 6 tasks.For the other two tasks, Quickdraw as the target domain and Real as the target domain, the SFDA in this paper is only about 1% behind FADA in accuracy.The experimental results on DomainNet provide a reference for the practical application of the algorithm in this paper because its data volume and the number of categories are very large and the image itself is complex enough.

Comparison of communication cost
Table 4 shows the change in communication times with the total number of epochs when batch_size = 64 and MNIST-M is the target domain.It is easy to see from this that the number of communications required by SFDA is significantly smaller and that the number of communications required by FADA is around three orders of magnitude larger than that of SFDA.

Conclusions and future work
This paper proposes a distributed UMDA approach under a federated learning framework.A unified source model is obtained by aggregating the learned models in multiple source domains (a label smoothing technique is used for the training of the models in each source domain), and the target model is initialized by using the model.Then, the classifier of the target model is trained in a self-supervised way by using the information maximization and pseudo-labelling technique.SFDA takes into account both data privacy protection and communication efficiency while achieving ideal accuracy.Through the comparison with the existing algorithms, both the communication efficiency and the accuracy are greatly improved.In future work, we hope to design a more effective weighting strategy to solve the effect of self-supervised learning on the target domain largely depends on the quality of the source model.

Appendix A
If each source domain model can be adequately trained, they can represent almost perfectly their own source domain, which makes the size of the data volume meaningless in the weighting process.However, when the source models are aggregated in the early stages of training, they may not be fully trained due to the small number of iterations.Therefore, this part explains the defects of using data volume from another aspect.If the closeness of each source domain to the target domain has been quantified as {c 1 , c 2 , . . ., c K } and the larger the value, the closer the relationship between the domains is, and the data volume of each source domain happens to be In this case, if the weight of a source domain is calculated in the following way, the size of the data volume just offsets the effect of the tightness:

Figure 2
Figure 2 shows the complete algorithm framework.The algorithm flow in this paper is mainly divided into two steps: training the source model and training the target model classifier.The source model is composed of a feature learning module and classifier module (hypothesis).SFDA leaves the feature extractor fixed and uses the classifier module as the initializer of the target domain learning.

Fig. 2
Fig. 2 SFDA algorithm framework.The figure shows function of parts and process between sections.

Fig. 3
Fig. 3 Examples of images in a DigitFive and b DomainNet.DigitFive contains five digit recognition datasets.And DomainNet is by far the largest multi-source domain adaptation dataset, with a total of 6 domains (Clipart, Infograph, Painting, Quickdraw, Real, Sketch)

Table 1
Accuracy (%) of unsupervised multi-source domain adaptation on DigitFive

Table 2
Influence of label smoothing and pseudo-labelling on the accuracy (%) in DigitFive

Table 3
Accuracy (%) of unsupervised multi-source domain adaptation on DomainNet

Table 4
Comparison of time consumption between FADA and SFDA