### 4.1 Experimental setup

Our method was implemented in Keras [36], which is a high-level neural network API, and CEKA [37], which is an open software package for machine learning in crowdsourcing and contains many existing benchmarks for label aggregation. One real-world dataset was used in our experiments. Data set *Div400* is a image retrieval dataset [38], which was created to help evaluation in various areas of social media photo retrieval, such as re-ranking, crowdsourcing and relevance feedback. This data set gathered 15,871 Flickr photos and collected relevance evaluation by crowds, which contains 160 query texts and each query corresponding to less than 120 images.

Since there are some query texts have few responses and all of the responses belongs to the same category, we remove the query texts with all positive or negative responses. In addition, the data set is unbalanced, the number of negative examples is obviously far less than the number of positive examples; here, we utilize data argumentation to generate images when the negative examples is less than 10% of positive examples for each query. Since deep learning is a data-driven technology, a large amount of data is necessary to train an appropriate model, so data argumentation is used to increase the number of data manually using information only in our training data to avoid overfitting problems. The common data argumentation methods contain random crop, flip, rotate, resize and so on. Finally, we obtain 16,536 photos as training set and 1,170 photos as testing set. For each query and text pair, we simulate 5 workers to provide a label to evaluate whether this pair is relevant or irrelevant; let 1 and 0 denote relevant and irrelevant respectively. We give these workers different sensitivities *μ* and specificities *γ* to denote the different accuracies of workers on different categories because workers reflect different bias tendency towards the positive and the negative labels; when the ground truth of examples adapt to the bias of a worker, it would be a higher accuracy for the worker to provide true labels, and vice versa. For instance, worker *j* prefers to provide positive labels, so *j* would have higher accuracy to label positive examples. When the true label of a pair is 1, the worker *j* will provide the correct label with probability *μ*_{j}; otherwise, when the true label is 0, the probability will be *γ*_{j}. Afterwards, the values of parameters for 5 workers are *μ*=[0.6,0.9,0.5,0.9,0.9] and *γ*=[0.3,0.2,0.5,0.8,0.1]. To obtain the features of images, we use VGG-16 model pre-trained on ImageNet dataset. Also, for word embedding, we use word2vec trained on the English Wikipedia dump. In this paper, we use Keras to build a standard LSTM layer to train the text features. The outputs of LSTM and VGG-16 are reshaped the size into (*n*,384). The similarity matrix are set as \(\textbf {M}_{r}\in \mathbb {R}^{384\times 384}\). The dimension of the output layer is 2 to represent the result is relevant or irrelevant.

We compare our method with five common ground truth inference methods MV, GTIC [22], Opt-D&S [21], DS [17], and GLAD [20].

The purpose of ground truth inference algorithms is to achieve the minimum of the empirical risk, then we will have a good chance of using integrated label \(\hat {y}_{i}\) as the ground truth *y*_{i}.

$$ \mathcal{R}_{emp}=\frac{1}{I}\sum_{i=1}^{I}{\mathbb{I}\left(\hat{y}_{i}\neq{y_{i}}\right)} $$

(4)

where \(\mathbb {I}\) denotes an indicator function whose output will be 1 if the input is true, or else the output will be 0. As the simplest but widely used method, MV follows a simple principle that if more than half of the workers provide the same label *c*_{k}, then the integrated label will be *c*_{k}. DS is an EM-based method, it defined that each worker have an exclusive matrix \(T^{(j)}=\left \{\pi _{kt}^{(j)}\right \}\) to indicate their labeling behavior called confusion matrix, where each element \(\pi _{kt}^{(j)}\) denotes the probability that worker *w*_{j} provides label *c*_{l} to the example whose true label is *c*_{k}. The purpose of DS is to estimate the label of each example and the confusion matrix of each worker simultaneously. DS method contains two steps; firstly, we will initialize the confusion matrices and the prior probabilities of each category. Usually, we assume that all the parameters subject to uniform distribution, so \(\pi _{kt}^{(j)}=\frac {1}{K}\) and \(P(c_{k})=\frac {1}{K}\), where 1≤*k*≤*K*,1≤*t*≤*K*, and 1≤*j*≤*J*. Then in E-step, it estimates the probability that each sample *e*_{i} belongs to each category *c*_{k}, that is

$$ P\left(\hat{y}_{i}=c_{k}|\mathcal{D}\right)=\frac{\prod_{j=1}^{J}\prod_{t=1}^{K}\pi_{kt}^{(j)}P\left(c_{k}\right)} {\sum_{q=1}^{K}\prod_{j=1}^{J}\prod_{t=1}^{K}\pi_{qt}^{(j)}P\left(c_{q}\right)} $$

(5)

After that, in the M-step, we recalculate each confusion matrix and the prior probability of each category, that is

$$ \hat{\pi}_{kt}^{(j)}=\frac{\sum_{i=1}^{I}\mathbb{I}\left(\hat{y}_{i}=c_{k}\right)} {\sum_{t=1}^{K}\sum_{i=1}^{I}\mathbb{I}\left(\hat{y}_{i}=c_{k}\right)} $$

(6)

$$ \hat{P}\left(c_{k}\right)=\frac{1}{I}\sum_{i=1}^{I}\mathbb{I}\left(\hat{y}_{i}=c_{k}\right) $$

(7)

Afterwards, we can repeat the E-step and M-step until all the estimates converge. In addition, GLAD is also an EM-based method. Unlike DS method, GLAD not only considers about the ability of workers, but also takes the specificity of samples into consideration. GLAD models’ two parameters *α*_{j} and *β*_{i} respectively denote the expertise of workers and the difficulty of samples, where *α*_{j}∈(−*∞*,+*∞*) and 1/*β*_{i}∈[0,+*∞*). Worker *w*_{j} is more likely to provide a correct label when *α*_{j} is higher, when *α*_{j}=0, it means the worker *w*_{j} is making a wild guess. Similarly, sample *e*_{i} will be harder to be labeled correctly when 1/*β*_{i} is higher; when 1/*β*_{i}=0, it means that sample *e*_{i} is too easy to be classified that anyone can label it correctly. There, GLAD defines that

$$ P\left(l_{ij}=y_{i}|\alpha_{j},\beta_{i}\right)=\frac{1}{1+e^{-\alpha_{j}\beta_{i}}} $$

(8)

Just like DS, after initializing the missing data, GLAD have two steps to iterative processing. In E-step, we obtain the posterior probabilities of all *y*_{i}:

$$ P\left(y_{i}|\mathbf{l}_{i},\alpha,\beta\right)\varpropto P\left(y_{i}\right)\prod_{j=1}^{J}P\left(l_{ij}|y_{i},\alpha_{j},\beta_{i}\right) $$

(9)

In the M-step, we can update the values of *α* and *β* by maximizing a standard auxiliary function \(\mathcal {Q}\):

$$ \begin{aligned} \mathcal{Q}\left(\alpha,\beta\right)&=E\left[lnP\left(\mathbf{l},y|\alpha,\beta\right)\right] \\ &=\sum_{i}E\left[lnP(y_{i})\right]+\sum_{ij}\left[lnP\left(l_{ij}|y_{i},\alpha_{j},\beta_{i}\right)\right] \end{aligned} $$

(10)

To avoid the limitation that EM-based algorithms cannot converge to the global optimal, Opt-D&S uses a spectral method to initialize the values of confusion matrix in DS method, which first divides the workers into three disjoint groups, and the average confusion matrix of the three groups is calculated separately. Then, the initial values of confusion matrix of each worker is set as the average value. In the second step, a DS algorithm runs with the initial values. Meanwhile, to solve the multi-class inference problems, GTIC [22] was proposed using Bayesian statistics. GTIC first generates K+1 features for each example *e*_{i} as \(\boldsymbol {\alpha }^{i}=\left \{\alpha _{1}^{i},...,\alpha _{k+1}^{i}\right \}\). Secondly, any clustering algorithm will be used to run on the dataset and divide the examples into *K* clusters, the number of each cluster represented as **N**_{n}, where *n*=1,...,*K*. Finally, for each cluster, we calculate a vector with *k* elements like:

$$ \nu_{k}^{n}=\sum_{i}^{\mathbf{N}_{n}}{\alpha_{k}^{i}},~\text{where}~1\leqslant n \leqslant K $$

(11)

So, the maximum element \(\nu _{k}^{n}\) in the vector *ν*^{n} will map the cluster *n* to the class *k* and all the examples belong to cluster *n* will be assigned to class *k*.

### 4.2 Experimental results

We first discuss the improvement of the model using similarity matrix. We build a base model which directly joint two representations in the merge step, then we introduce the similarity matrix into the model and compare the accuracy of two models [39, 40].

Figure 6 shows the comparison results of two models mentioned above, we find that on different configurations, the model using similarity matrix performs better than base model. With different configurations, the accuracies of model using similarity matrix are all higher than 62%, even higher than 70% when the batch size changes. On the other side, the model using concatenate are lower than 62% It demonstrates the effectiveness of similarity matrix; on the other hand, it shows that simply joint two representations to fuse the features cannot represent the data well in our case.

We investigate the effectiveness of our proposed method. Figure 7 shows the comparison results for aggregation. In our proposed method, we obtain the inference labels after the model is trained and remove the crowdsourcing layer.

As Fig. 7 shows, our method outperforms other five inference methods and the accuracy is 93.6%. We can find that the accuracy of our method is improved because we not only use the noisy labels to infer the ground truth, but also take the features of data into consideration. Besides, MV is still a robust method; the accuracy of MV is 84.4%, and it is the same as Opt-D&S and DS. GLAD also performs well with the accuracy 90.9%.

Therefore, in our third experiment, we show the prediction ability of our method. To do the comparison experiments, we use MV, GTIC, Opt-D&S, DS, and GLAD respectively to infer the integrated labels of training data, then we use these labels to train a base deep learning model without crowdsourcing layer. Meanwhile, we also use the ground truth to train the model [41, 42]. The comparison results are shown in Fig. 8.

As shown in Fig. 8, the accuracy of our method is 73.9%, which is much higher than others. The model trained by ground truth also performs well with the accuracy of 71.2%. The results demonstrate the effectiveness for use of crowd information.