In this section, we introduce the procedure of the proposed TDCAF. First, we utilize unlabelled samples to train a siamese network that predicts similarity scores for sample pairs. Then, we extract deep convolutional activation-based features from the trained siamese network. Finally, we apply the proposed vertical pooling strategy for the final feature representation.
The siamese network
Figure 1 briefly illustrates the architecture of the siamese network. Given a sample pair as the input, the siamese network predicts the similarity score of the two samples. The siamese network usually consists of two CNN models, one connection function and one fully connected (FC) layer. We utilize sample pairs to train the siamese network with weight sharing. Here, sample pairs consist of similar pairs and dissimilar pairs. When two samples from the source domain and the target domain belong to the same class (the different classes), we define them as a similar pair (a dissimilar pair). We utilize such sample pairs to train the siamese network so that the samples from different domains can be forced to the same feature space and could learn some domain-invariant characteristics. The CNN model can be CaffeNet [17], VGG19 [27], or ResNet-50 [28], where we change the number of kernels in the final FC layer according to the number of classes for fine-tuning the networks.
It should be noted that we train the siamese network with weight sharing. That means the trainable parameters used in the two CNN models are the same. The weight sharing strategy is an important principle in the siamese network because it helps to reduce the total number of trainable parameters. Furthermore, weight sharing leads to more efficient training and more effective model especially when some similar local structures appear in the input feature space.
The connection function is used to connect the output vectors of the two CNN models. In our model, we define the connection function as
$$\begin{array}{@{}rcl@{}} f = (f_{1} - f_{2})^{2} \end{array} $$
(1)
where f1 and f2 are the output vectors of the two CNN models, respectively, and they are both 1024-dim vectors. f is the 1024-dim output vector of the connection function.
As shown in Fig. 1, we then take f as the input of the FC, and the resulting vector x can be expressed as
$$\begin{array}{@{}rcl@{}} x = \theta \circ f \end{array} $$
(2)
where ∘ denotes the convolutional operation. θ is the parameters of the FC, and its dimension is 1024.
Since this is a binary classification problem, we utilize the final layer to convert x to a 2-dim vector (z1,z2) which is then fed into the softmax function to obtain the predicted probability of the input sample pair belonging to the same class. The formulation of the softmax function is
$$\begin{array}{@{}rcl@{}} \hat{p_{i}} = \frac{e^{z_{i}}}{\sum_{k=1}^{2}{e^{z_{k}}}} \end{array} $$
(3)
where \(\hat {p_{i}}\) is the predicted probability, and \(\hat {p_{1}} + \hat {p_{2}} = 1\).
Finally, we use the cross-entropy loss for this binary classification
$$\begin{array}{@{}rcl@{}} Loss = \sum_{i=1}^{2}{-p_{i}\ \text{log}(\hat{p_{i}})} \end{array} $$
(4)
where p
i
is the true probability. As for a similar pair, p1 = 1, and p2 = 0. While for a dissimilar pair, p1 = 0, and p2 = 1.
In the forward propagation, according to Eq. (3), Eq. (4) can be reformulated as
$$\begin{array}{@{}rcl@{}} Loss = -p_{1}\ \text{log}\frac{e^{z_{1}}}{\sum_{k=1}^{2}{e^{z_{k}}}} - p_{2}\ \text{log}\frac{e^{z_{2}}}{\sum_{k=1}^{2}{e^{z_{k}}}} \end{array} $$
(5)
As for a similar pair, i.e., p1 = 1, and p2 = 0. Equation (5) can be rewritten as
$$\begin{array}{@{}rcl@{}} Loss = -\text{log}\frac{e^{z_{1}}}{\sum_{k=1}^{2}{e^{z_{k}}}} \end{array} $$
(6)
and as for a dissimilar pair, i.e., p1 = 0, and p2 = 1. Equation (5) is reformulated as
$$\begin{array}{@{}rcl@{}} Loss = -\text{log}\frac{e^{z_{2}}}{\sum_{k=1}^{2}{e^{z_{k}}}} \end{array} $$
(7)
We adopt the mini-batch stochastic gradient descent (SGD) [29] and error backpropagation algorithm (BP) to train the siamese network. In the backpropagation, we take the derivative of Eqs. (6) and (7) with respect to z1 and z2, respectively, and obtain
$$\begin{array}{@{}rcl@{}} Loss' = \frac{e^{z_{1}}}{\sum_{k=1}^{2}{e^{z_{k}}}} - 1 \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} Loss' = \frac{e^{z_{2}}}{\sum_{k=1}^{2}{e^{z_{k}}}} - 1 \end{array} $$
(9)
Generally, since a large number of trainable parameters should be learned for CNN, an effective model requires lots of training samples. If we train a CNN model with insufficient training samples, it would lead to overfitting. To address the problem, we train the siamese network by fine-tuning the pre-trained CNN model.
Transfer deep convolutional activation-based features
The convolutional layers are main components of the CNN model and can capture more local image characteristics [30, 31]. Hence, we extract deep convolutional activation-based features from a certain convolutional layer to represent images. Suppose that there are N feature maps from a certain convolutional layer of the CNN model. As shown in Fig. 2a, different feature maps tend to have various activations for the same image, meaning that these feature maps describe different patterns. Hence, in order to obtain completed features, all feature maps for a convolutional layer should be considered for the image representation. Traditional methods aggregate all convolutional activations from one feature map into one activation value as shown in Fig. 2a. The resulting activation value is chosen from the maximum value or average value of one feature map. Then, the activation value of each feature map is concatenated into a N-dim feature vector. However, the feature vector is insensitive to spatial distribution variation.
To address this problem, we propose the vertical pooling strategy, which contains the sum operation or the max operation. As for the sum operation, the deep convolutional activations at the same position of all feature maps are added, resulting in the CASM with the size of H×W. Then the CASM is straightened into a (H×W)-dim TDCAF which thus obtains completed information and preserves the spatial information of images. For describing the sum operation more clearly, let fn(a,b) be the convolutional activations at position (a,b) from the n-th feature map fn, and the sum-operation feature F
s
(a,b) at this position is defined as
$$\begin{array}{@{}rcl@{}} F_{s}(a,b)=\sum_{n=1}^{N}{f^{n}(a, b)} ~(a\in H, b\in W) \end{array} $$
(10)
Then an image can be represented as F
s
={F
s
(1,1),F
s
(1,2),…,F
s
(a,b)}. The process is shown in Fig. 2b.
Similarly, as for the max operation, we preserve the maximum convolutional activation at the same position for all feature maps where the resulting convolutional activation is salient and more robust to local transformations. The max-operation feature F
m
(a,b) is formulated as
$$\begin{array}{@{}rcl@{}} F_{m}{(a,b)}=\underset{1\leq n \leq N}{\text{max}}\ {f^{n}{(a, b)}} \end{array} $$
(11)
and an image can be represented as F
m
={F
m
(1,1),F
m
(1,2),…,F
m
(a,b)}.