Overview
This paper proposes a recommendation model based on deep learning, which can process multi-source heterogeneous data: score, review, and social information.
For the score, the traditional matrix decomposition method suffers problems of sparse data and low accuracy. This paper adopts the neural network to transform scores into the user/item representations. For the reviews, the traditional topic model cannot accurately represent the characteristics of the text. This paper utilizes the Distributed Bag of Words version of Paragraph Vector (PV-DBOW) algorithm to learn the feature representations of reviews. PV-DBOW assumes that the words in the document are independent and unordered, and uses document vector representation to predict the words with higher accuracy. For social network data, this paper takes into account the impact of users’ friends on users’ selection, introduces the user trust model, and integrates the social relationship information into the pairwise learning method, which improves the accuracy of the recommendation results.
Recommendation process
Due to the heterogeneity of different data, the traditional hybrid recommendation model usually fuses data at the algorithm level [30], i.e., makes the final recommendation by combining recommendation results from algorithms based on different data. With the development of deep learning, multi-source heterogeneous data such as scores and reviews can be accurately represented through deep networks, which makes it possible to fuse multi-source heterogeneous data fully at the data source level [30]. The multi-source heterogeneous data recommendation model proposed in this paper combines ratings, reviews, and social network information to make a more accurate recommendation. It has the advantages of high accuracy and strong scalability.
The recommendation process is shown in Fig. 1. The score is a user’s overall evaluation for an item, which reflects the user’s satisfaction with the item. The multi-layer fully connected neural network is used to directly learn the feature vector representations of the user and the item. Reviews can reflect users’ evaluations for items in detail and contain rich information about users and items. PV-DBOW algorithm is used to learn the feature representation of the paragraph and thus obtains the feature vector representations of the user and the item. The social network reflects friendships between users. The preferences of users’ friends will indirectly affect the users’ choices. The social network can be used to improve the prediction accuracy of a user’s potential purchase behavior. Bayesian Personalized Ranking (BPR) model is used to rank the nonlinear characteristics of users and items, and further improves the accuracy of the recommendation results.
Recommendation model
The recommendation model of multi-source heterogeneous data consists of four steps. Firstly, construct the user and item triplet optimization model. Secondly, extract social relations from the social network, and fuse social relation data, review data, and scores together. Thirdly, obtain the feature representations of users and items through deep learning. Finally, a top-N recommendation list is acquired from the feature representations of users and items. The model is described in detail as below.
User trust model
Social networks can reflect the friendship between users. In real life, users are more likely to choose items that their friends buy or like. Thus, a user’s behavior and preferences can be more precisely predicted based on the user’s direct and indirect friend relationship.
The trust-based recommendation model assumes that users have similar preferences to their trusted users. In general, direct and indirect friends can affect a user’s decisions on different levels, and indirect friends have less impact on the user’s decisions than direct friends. According to Kevin Bacon’s 6 degrees of separation concept [35, 36], the similarity between users can be defined in (1):
$$ s(a, b)=\left\{\begin{array}{cc} 0.2 \times\left(6-l_{a b}\right) & \text{if}~ l_{a b}<6 \\ {0.1} & \text{otherwise} \end{array}\right. $$
(1)
Among them, a and b represent any two users. lab represents the distance between user a and user b, the distance of direct friend is 1, the distance of indirect friend is 2,3,4,⋯, and s(a,b) represents the similarity between two users. Figure 2 shows the distance values between users.
The similarity between users can be calculated based on the distances between users. We name the direct friend as the first-degree friend, the indirect friends with distance 2 as the second-degree friend, and so forth. We consider an indirect friend with distance 6 at most so that we name the model as 6 Degree Model. Algorithm 1 gives the model’s pseudo-codes and shows how to calculate similarities between users.
After obtaining the similarity between users from social networks, the influence of different friends on the user’s selection can be get; then, it can be input to a unified joint representation learning framework together with the other types of data.
Improved BPR model
BPR is a pairwise learning model [37]. A triplet (u,i,j) is constructed based on the user’s preferences. A triplet can represent three cases:
User u purchases item i but does not purchase item j. It means user u has a preference on item i than item j.
User u purchases neither item i nor item j. It means the user’s preferences cannot be determined.
User u purchases both item i and item j. It means the user’s preferences cannot be distinguished.
Comparing with pointwise learning, the BPR model has two advantages. The first is considering both the items purchased by the user and the items not purchased during learning, and the items not purchased are the ones to be sorted in the future. The second advantage is that this model can reach good results when a small amount of data is selected for recommendation.
BPR is a ranking algorithm based on matrix decomposition. Comparing with algorithms such as funkSVD, it is not a global scoring optimization but a ranking optimization for each user’s own commodity preferences. Its result is more accurate. Figure 3 shows the triplet generation process. Plus (+) means user u prefers item i over item j. Minus (−) means user u prefers item j over item i. Question mark (?) means that the user’s preference cannot be determined.
However, the triplets constructed based on the standard BPR model are randomly sampled [38], and the effect of social relationships on the sampling process is not considered. In real life, users prefer the items that their friends have selected. So the similarity between users and friends can be applied to the sampling of the BPR model. By considering friends’ influences on the user and adding social relation constraints to the sampling process, the triplet can more precisely reflect the user’s preferences, and thereby, the recommendation accuracy can be improved.
According to the user’s purchase records and the friendships reflected by the social network, for each user u, the item purchased by the user is defined as i, the item that the user has not purchased is defined as j, and the item purchased by the user’s direct or indirect friend is defined as p. All the items set in the system are defined as D. The set of items purchased by the user u is defined as Du. The set of items purchased by the user’s direct and indirect friends is defined as Dp. The item set representing the user’s strong preference is firstly Du and secondly Dp∖Du. The reason is that the user is likely to purchase the item Dp∖Du purchased by the direct or indirect friends but not by the user according to the influences of the friends on the user’s preference. Finally, the item that the user is least likely to purchase is D∖(Du∪Dp). Constructing a triplet of users and items as a training set based on social network information, the train set T can be expressed as follows. Where user-item triplet (u,i,j) represents that the user u has a greater preference for the item i than the item j. Item i is purchased by the user or by the direct or indirect friends of the user. Item j means the item not purchased by the user or his/her friends. In this way, a user-item triplet based on social relations is constructed.
$$ {T}:=\left\{({u}, {i}, j) | i \in\left({D}_{u} \cup {D}_{p}\right), j \in {D} \backslash\left({D}_{u} \cup {D}_{p}\right)\right\} $$
(2)
According to the Bayesian formula, it is necessary to maximize the following posterior probabilities for finding a list of items recommended. In (3), (u,i,j) represents a constructed triplet with the user’s preference, and θ represents the parameters of the model. To make the triplet (u,i,j) has the highest probability of occurrence, adjusting the model parameters.
$$ {p}({\theta} |({u}, {i}, {j})) \propto {p}(({u}, {i}, {j}) | {\theta}) {p}({\theta}) $$
(3)
To simplify the aforementioned formula, we assume that the item pairs (i,j) are independent of each other. Then, we have:
$$ \begin{aligned} {\prod}p\left(\left({u}, {i}, j\right) | {\theta}\right)=&{\prod}_{u \in(u, i, j)}{p}\left({i}>j | {\theta}\right)\\ &\cdot{\prod}_{u \notin(u, i, j)} (1-p({i}>{j} | {\theta})) \end{aligned} $$
(4)
According to the integrity and anti-symmetry of pairwise learning, the above formula can be further simplified to:
$$ {\prod}p\left(\left({u}, {i}, j\right) | {\theta}\right)={\prod}_{u \in(u, i, j)}{p}\left({i}>j | {\theta}\right) $$
(5)
To obtain the final sorting, a model needs to be constructed to calculate the probability of recommendation for each item. The sigmoid function is used to construct the model in which the probability of the user’s purchasing item i is greater than the one of purchasing item j. Where xuij(θ) is an arbitrary parametric model that describes the potential relationship between the user and the item. In other words, any model that describes the relationship between the user and the item can be used.
$$ {p}(({u}, {i}, {j}) | {\theta}):={\sigma}\left({x}_{u i j}({\theta})\right) $$
(6)
Improved BPR model is used to directly optimize the recommendation result based on the item recommendation ranking.
PV-DBOW model
PV-DBOW model is used to learn review data to obtain feature representations of corresponding users and items. As Fig. 4 shows, the model samples a text window, then samples a random word from the text window and forms a classification task given the Paragraph Vector [39]. PV-DBOW assumes that the words in a sentence are independent of each other and requires only a small amount of data to be stored.
In our model, paragraph vectors are used to predict words. Each review will be mapped into a semantic space and then trained to predict words. The probability that the word w appears in sentence d can be calculated by calculating the softmax function.
$$ P\left(w | d_{u m}\right)=\frac{e^{w^{T} d_{u m}}}{\sum{w' \in V e^{w^{' T} d_{u m}}}} $$
(7)
where dum denotes the review given by user u to item m, w denotes the word, and V denotes the vocabulary. In order to reduce the cost of computation and improve the calculation efficiency, the negative sampling example strategy is adopted in this model. Therefore, the following objective function can be constructed:
$$ \begin{aligned} &{L}_{1}=\sum_{w \in V} \sum_{(u, m) \in R} {f}_{w, d_{u m}} \log \sigma\left({w}^{T} {d}_{u m}\right)+\\ &\sum_{w \in V} \sum_{(u, m) \in R} f_{w, d_{u m}}\left(t \cdot E_{w_{N} \sim P_{V}} \log \sigma\left(-w_{N}^{T} d_{u m}\right)\right) \end{aligned} $$
(8)
where \(f_{w,d_{um} }\) represents the frequency of word and review pairs. \(E_{w_{N}\sim P_{V} }\) represents the expected value on the noise distribution PV, and t represents the negative sample numbers. According to (8), the comment representation dum can be obtained and dum corresponds to the user u and the item m. The feature representations of u and m can be obtained from the comment data.
Fully connected neural network
For the reason that neural networks have the ability to quickly find optimal solutions, the fully connected neural network is used to process the scoring data [30]. The representations of user and item can be obtained from the score data. In this experiment, two fully connected layers are used to fit the nonlinear correlation:
$$ \hat{r}_{u m}=\phi\left(U_{2} \phi\left(U_{1}\left(r_{u} \odot r_{m}\right)+c_{1}\right)+c_{2}\right) $$
(9)
where ϕ(·) is the ELU activation function and U1,U2,c1,c2 are the parameters to be learned. Then, the following objective function can be get:
$$ L_{2}={\sum\nolimits}_{(u, m) \epsilon R}\left(\hat{r}_{u m}-r_{u m}\right)^{2} $$
(10)
The goal of the scoring model is to make the difference between the predicted score and the true score as small as possible.
BRS cS model
To fuse multi-source heterogeneous data to make a recommendation, we propose a model named BRS cS (an acronym for BPR-Review-Score-Social). In the model, improved BPR model is used to optimize the ranking, user trust model is used to introduce social relationships into the rating and review data, PV-DBOW model is used to process the review data, and fully connected neural network is used to process the rating data. Finally, an integrated objective function is given to optimize.
The unified objective function for model optimization is given as (11). u represents the fusion feature representation of the user, and i and j represent the fusion feature representation of the items. According to the previous definition, it is known that the user u has a greater preference for the item i than the item j. g(·) is a loss function that combines user and item features; this paper defines g(·) as a sigmoid function to calculate the user’s different preferences for different items. Here, g(u,i,j)=σ(uTi−uTj). L1 is the objective function of the review data, and L2 is the objective function of the score data. When adding a new data source to the recommendation system, we only need to add the corresponding objective function in (11) instead of redesigning the model. The model proposed has good scalability.
$$ \begin{aligned} &\max_{W, {\theta}} L=\sum_{u, i, j} {g}({u}, {i}, {j})+\lambda_{1} {L}_{1}-\lambda_{2} {L}_{2}\\ &=\sum_{u, i, j}\left\{{\sigma}({u}^{T} {i}-{u}^{T} {j})\right.\\ &+\lambda_{1}(\sum_{{w} \in V} \sum_{({u}, {m}) \in R} {f}_{{w}, {d}_{{u m}}} \log \sigma\left({w}^{T} {d}_{{u m}}\right) \\ &+\sum_{{w} \in V} \sum_{({u}, {m}) \in R} {f}_{{w}, {d}_{{u m}}}(t \cdot {E}_{{w}_{N} \sim P_{V}} \sigma(-{w}_{N}^{T} {d}_{{u} m}))- \\ &\left.\lambda_{2}({\phi}({U}_{2} \cdot {\phi}\left({U}_{\mathbf{1}}\left({r}_{{u}} \odot {r}_{{m}}\right)+{c}_{1}\right)+{c}_{2})-r_{u m})^{2}\right\} \end{aligned} $$
(11)
W={W1,W2} denotes the weight parameters of each model. In the review representation learning model, the weight parameter W1 is different for different user’s different review and need to be learned. In the score representation learning model, the features of users and items are directly obtained, that is, the weight parameter W2 can be set to 1. It is unnecessary to update W2 by the optimization objective function. θ represents other parameters to be learned, θ={θ1,θ2}={{w,dum},{U1,U2,c1,c2,ru,rm}}. λ is the penalty parameter for each model, and its value is in the interval [0,1]. The objective function L2 of the score model is preceded by a negative sign because the objective function of the score model should be minimized, while the objective function of the overall model should be maximized. The stochastic gradient descent (SGD) method [40] can be used to optimize (11).
In the end, a recommendation list can be obtained by multiplying the user feature representations and the item feature representations:
The larger the s, the higher possibility for the user to select the item. A user’s top-N recommendation lisbobtained from (12) in descending order.