Problem definition
First, we formally define two concepts, namely neighbor users and neighbor web services.
Definition 1
(Neighbor Users) Given a user u, its neighbor users N(u) are defined as the user that have similar QoS experiences as user u on a set of commonly invoked web services.
Definition 2
(Neighbor Web Services) Given a web service i, its neighbor web services N(i) are defined as the web service that delivers similar QoS experiences as service i to a group of users.
Given a user u, a web service i, a set of users U={u1,u2,⋯,um}, and a set of web services I={i1,i2,⋯,in}, the method proposed in this paper mainly consists of two major operations: (1) to identify N(u) from U and N(i) from I, and (2) to predict the quality of web service i for user u based on N(u) and N(i).
Neighbor selection
Given a user service matrix R which includes m users and n services, and every element, Rij in R shows the value of a client service quality attribute of service j observed by user i. Rij is empty if service j has not been invoked by the users before. Pearson’s correlation coefficient (PCC) or vector space similarity (VSS) can calculate the similarity between different users by using the available service quality values in the user item matrix collected from different users. It is common to use PCC and VSS as similarity calculation methods. As [29] and [30], since PCC takes into account differences in user value styles and enables high precision, it can commonly get higher performance than VSS, which is the reason we use PCC as the similarity calculation method. On the basis of the QoS values they observed in the jointly invoked web service, it can calculate the similarity between user u and v by using PCC, as calculated by Eq. (1):
$$ \text{PCC}(u,v)= \frac{\sum_{j\in J}(R_{u,j}- \overline{R}_{u})(R_{v,j}-\overline{R}_{v})}{\sqrt{\sum_{j\in J}(R_{u,j}- \overline{R}_{u})^{2}}\sqrt{\sum_{j\in J}(R_{v,j}-\overline{R}_{v})^{2}}} $$
(1)
where J represents the set of services invoked by users u and v, and Ru,j shows user u’s QoS value for service j. \(\overline {R}_{u}\) and \(\overline {R}_{v}\) are the average QoS value of the services which are evaluated by users u and v, respectively. The value range of the PCC (u,v) is [−1,1], and the larger the PCC value, the greater the similarity between users.
PCC value can identify a set of Top-k similar users by calculating the similarity between current user and other users. However, a user may have a limited number of similar users. And this problem is ignored by Top-k algorithm which still includes different users with negative PCC values; thus, it will lead to the great influence on the prediction accuracy. In our method, different users (negative PCC values) with negative correlations are excluded by us. Thus, user u can identify a group of similar users by using the following equation:
$$ N(u)=\{v | v\in \text{Top}-k(u), \text{PCC}(u,v)>0, u \neq v\} $$
(2)
where Top- k(u) is a set of Top-k similar users to the current user u; PCC (u,v) is a similarity value between the users u and v and can be calculated by Eq. (1). Note that the Top-k relationship is asymmetric. User v is in the Top-k neighbor of user u which does not mean that user u is also a Top-k neighbor of user v.
Similarly, on the basis of the QoS values they have invoked on the user, it can calculate the similarity between services i and j by using the PCC, as calculated by Eq. (3):
$$ \text{PCC}(i,j)= \frac{\sum_{v\in V}(R_{v,j}- \overline{R}_{i})(R_{v,j}-\overline{R}_{j})}{\sqrt{\sum_{v\in V}(R_{v,j}- \overline{R}_{i})^{2}}\sqrt{\sum_{v\in V}(R_{v,j}-\overline{R}_{j})^{2}}} $$
(3)
where V represents the set of users invoking services i and j, and Rv,j the service quality value of user v for service j. \(\overline {R}_{i}\) and \(\overline {R}_{j}\) are the average service quality values of services i and j, respectively.
According to the PCC value, a set of Top-k similar services after calculating the similarity between the current service and other services can be identified. Service i can identify a group of similar users by the following equation:
$$ N(i)=\{j | j\in \text{Top}-k(i), \text{PCC}(i,j)>0, i \neq j\} $$
(4)
where Top- k(i) represents a set of Top-k similar services with current service i. PCC(i,j) is the similarity value between services i and j and can be calculated by Eq. (3). Similarly, the Top-k relationship is asymmetric.
NDL model
Figure 1 shows the overall framework of the method in this paper, and it mainly has the following three functional modules:
1) User information learning module: To generate user QoS vector and user neighbor QoS matrix according to historical QoS information and Top-k neighbor users of target users, respectively, as input of MLP and CNN for nonlinear learning to get the user feature vector and the user neighbor feature vector. Then, the user feature vector and the user neighbor feature vector are combined as the input of the multi-layer perceptron (MLP) to perform nonlinear learning, and the vector p is obtained.
2) Service information learning module: To generate service QoS vector and service neighbor QoS matrix according to historical QoS information of target services and Top-k neighbor services of target services, as input of the MLP and CNN respectively for nonlinear learning to get the service feature vector and the service neighbor feature vector. Then, the service feature vector and the service neighbor feature vector are combined and used as the input of the MLP to perform nonlinear learning, and the vector q is obtained.
3) Prediction module: The predicted QoS value is obtained through the inner product of the vector p obtained by the user information learning module and the vector q obtained by the service information learning module.
In the following sections, we will detail the implementation procedure of the above three functional modules.
User information learning module
Here, we assume that the target user is u, and from Section 3.2, we can get the neighbor N(u) of user u according to the user service matrix R. Then, the QoS vector Vu of the user u and the QoS matrix Mu of the Top-k neighbors of the user u are extracted from the user service matrix R, respectively. Vu and Mu are the input of MLP and CNN, respectively, as shown on the left side of Fig. 1. More formally, the forward propagation process is defined as follows:
$$ x_{u}=f_{n}\left({W_{n}^{T}}(f_{n-1}(\cdots f_{1}({W^{T}_{1}}x+b_{1})+b_{n-1}+b_{n}\right) $$
(6)
$$ {a^{u}_{n}}={f^{u}_{n}}(f_{n-1}(\cdots f_{1}(\text{convolution} + \text{max} - \text{pooling}(M_{u})))) $$
(7)
$$ v_{n}=\text{flatten}\left({a_{n}^{u}}\right) $$
(8)
It can be seen from the above formulas that the user QoS vector Vu is nonlinearly learned as the input of the MLP to obtain the vector xu. The user neighbor QoS matrix Mu is nonlinearly learned as the input of the CNN to obtain the vector vu, where wk,bk, and fk represent the weight matrix, offset vector, and activation function of the kth layer perceptron, respectively. Convolution + max − pooling are the convolution and pooling operations, \({f_{n}^{u}}\) is the activation function, and flatten function is used to flatten the input tensor into a one-dimensional vector. For the activation function of the MLP and the CNN, Sigmoid, hyperbolic tangent function (Tanh), and rectifier (ReLU) can be freely selected.
The Sigmoid function limits each neuron to (0,1), which may limit the performance of the model in this paper. It is well known that when the output of a neuron approaches 0 or 1, learning will stop. Moreover, this paper aims to predict the QoS value of web services, which is essentially a regression problem, and the Sigmoid function is more suitable for a binary classification problem. Even if the Tanh function is a better choice and widely used, it only relieves the problem of the Sigmoid function to a certain extent because it can be regarded as an upgraded version of Sigmoid(Tanh(x/2)=2σ(x)−1)). Similar to the Sigmoid function, the Tanh function is more suitable for a binary classification problem. Therefore, we chose the ReLU function, which is more reasonable and proven to be unsaturated. In addition, it encourages sparse activation, is ideal for sparse data, and makes the model less likely to be overfitted.
Then, xu and vu are combined into the input of the MLP. More formally, the forward propagation process is defined as follows:
$$ {x= \left[ \begin{array}{c} x_{u}\\ v_{u} \end{array} \right ]} $$
(9)
$$ a_{n}=f_{n}({W_{n}^{T}}(f_{n-1}(\cdots f_{1}({W^{T}_{1}}x+b_{1})+b_{n-1}+b_{n}) $$
(10)
where wk,bk, and fk represent the weight matrix, the offset vector, and the activation function of the kth layer perceptron, respectively. We use the ReLU function as the activation function here.
Service information learning module
Similar to Section 3.3.1, here, we assume that the target service is i, and from Section 3.2, we can get the neighbor N(i) of service i according to the user service matrix R. Then, the QoS vector Vi of service i and the QoS matrix Mi of the Top-k neighbors of the service i are extracted from the user service matrix R, respectively. Vi and Mi are input as the MLP and the CNN, respectively, as shown on the right side of Fig. 1. More formally, the forward propagation process is defined as follows:
$$ x_{i}=f_{n}({W_{n}^{T}}(f_{n-1}(\cdots f_{1}({W^{T}_{1}}x+b_{1})+b_{n-1}+b_{n}) $$
(13)
$$ {a^{i}_{n}}={f^{i}_{n}}(f_{n-1}(\cdots f_{1}(\text{convolution} + \text{max} - \text{pooling}(M_{i})))) $$
(14)
$$ v_{i}=\text{flatten}\left({a_{n}^{i}}\right) $$
(15)
It can be seen from the above formulas that the service QoS vector Vi is nonlinearly learned as the input of the MLP to obtain the vector xi. The service neighbor QoS matrix Mi is used as the input of the CNN for nonlinear learning to obtain the vector vi, where wk, bk, and fk represent the weight matrix, the offset vector, and the activation function of the kth layer perceptron, respectively. Similarly, convolution + max − pooling are the convolution and pooling operations, \({f_{n}^{i}}\) is the activation function, and the flatten function is used to flatten the input tensor into a one-dimensional vector. We use the ReLU function here for the activation function of the MLP and the CNN.
Then, wecombine xi and vi into the input of the MLP. More formally, the forward propagation process is defined as follows:
$$ {x= \left[ \begin{array}{c} x_{i}\\ v_{i} \end{array} \right ]} $$
(16)
$$ a_{n}=f_{n}({W_{n}^{T}}(f_{n-1}(\cdots f_{1}({W^{T}_{1}}x+b_{1})+b_{n-1}+b_{n}) $$
(17)
where wk, bk, and fk represent the weight matrix, the offset vector, and the activation function of the kth layer perceptron, respectively. We use the ReLU function as the activation function here.
As for the design of the network structure, a common solution is to follow the tower mode, with the widest bottom layer and each successive layer having a small number of neurons. The precondition is that by using a small number of hidden units at a higher level, they can learn more about the abstract characteristics of the data. We have empirically implemented the structure of the tower so that the size of the layer is half that of each successive higher layer.
Prediction module
The vector p and the vector q are obtained according to Sections 3.3.1 and 3.3.2, and the final predicted value can be obtained as follows:
Model learning
Another important problem of the QoS prediction model is to define an appropriate objective function for model optimization based on observed data and unobserved feedback.
The general objective function is as follows:
$$ L=\sum_{y \in Y}l(y,p(u,i)) $$
(20)
where l(·) represents the loss function and y is the nonzero element in the training user service matrix R.
For recommendation system, the loss function is the most important part of the objective function. In many existing models, the square loss is mainly manifested as follows:
$$ L=\sum_{y \in Y}(y,p(u,i))^{2} $$
(21)
For the NDL model to start training from scratch, we use adaptive moment estimation (Adam), which performs a small update to frequent updates and performs large updates on infrequent parameters to accommodate each parameter learning. The Adam method converges faster than the normal SGD for both models and alleviates the difficulty of adjusting the learning rate.
The total implementation procedure of NDL is presented in Algorithm 1.