MGSSL algorithm usually conducts classification by centralized processing, and it cannot directly calculate when dealing with semi-supervised large-scale multi-graph classification. In view of the emergence of such shortcomings, MR-MGSSL algorithm combining the MapReduce framework and MGSSL is proposed to conduct semi-supervised large-scale multi-map classification.
MR-MGSSL semi-supervised large-scale multi-graph classification algorithm
In the semi-supervised large-scale multi-map classification, MR − MGSSL is generally divided into three steps shown below (Fig. 2).
Training data vectorization
The existing MGSSL algorithms cannot be directly applied to the semi-supervised multi-graph classification. We must first select the feature subgraphs, and transform the multi-graph data into eigenvectors, and then use the MGSSL algorithms to find the rules from the transformed eigenvectors. Construct subgraph model to conduct prediction of the calculation.
On the basis of the MR-MGSSL algorithm, an algorithm is proposed in the paper to select the optimal feature subset.
At present, the selection of feature subsets is determined by the single record of the scoring function. Therefore, in determining the semi-supervised multi-graph classification problem, we need to first determine the scores of the single frequent subgraph and then select N optimal characteristic subgraphs with the largest score.
In general, during the selection process of the feature subset, it first needs to select a subgraph appearing in multi-frequency and calculate its value, and the calculation of the score rei needs to first understand the matrix MNy, MEy, rNy, and rEy. In a text message, MNy and MEy of the multi-frequency subgraph rei is the same, so it is only necessary to compute the subgraphs included in the Ny and Et sum. And then calculate the value of each feature subgraph according to the formula \( Y\left({r}_{ei}\right)={r}_{Ny}^s{L}_{Ny}{r}_{Ny}+{r}_{Et}^s{L}_{Et}{r}_{Et} \). Finally, by calculating the partial optimal characteristic subgraphs, the value of the characteristic subgraph of all the text information is calculated and expressed by the vector.
Pre-calculate the matrix of MNy and MEy and the value of the multi-frequency characteristic subgraph.
Pre-calculation method
Calculate the matrix of MNy and MNy, id of text information and the list Bag − list and Gra − list of Et. The multi-graph is represented by the function record of the graph selection stage. In the multi-graph with labels, when the class label in the graph is positive, it is expressed as input < 1 ⋅ 1 > and < 4, |graph| > (2~3line): if the output is negative, it is expressed as output < 2 ⋅ 1 > and < 5, |graph| > (4~5line). The unlabeled multi-graph is expressed as output < 3 ⋅ 1 > and < 6, |graph| > (6line). The role of keys 1 to 8 is to produce a synergistic effect on the calculation of |Ny+|, |Ny‐|, |Nyv|, |Et+|, |Et‐|, |Etv|, Bag − list, and Gra − list. And then, according to the above calculated key value to calculate |Ny+|, |Ny‐|, |Nyv|, |Et+|, |Et‐|, |Etv|, Bag − list, and Gra − list in line 12 to 14. Finally, in the calculation of these key values, MNy andMEy is calculated.
Use MR-MGSSL algorithm to pre-calculate.
MR-MGSSL algorithm:
In the prediction method, it is necessary to obtain the multi-graph and the super multi-map first, and then determine whether the frequency of the multi-frequency subgraph has been calculated. If it is calculated, it is output directly according to the calculation step; otherwise, it needs to be judged again until the output is calculated. Finally, the calculated frequency is compared with its threshold, and the multi-graph and super multi-graph of multi-frequency subgraph are output.
The selection of the optimal feature subgraph and the value calculation: the characteristic subgraph refers to the multi-frequency subgraph that occurred with the highest frequency in the text information, and the selection of multi-frequency feature sub-map first needs to calculate the frequency of the subgraph that occurred in the text information and then according to the frequency, determine the multi-map and super multi-map of multi-frequency subgraph. In general, the text information is divided into pieces, and then its frequency in the multi-frequency subgraph has been determined; when determined, output, if not sure, needs to re-calculate the frequency subgraph, until it is determined and then output. Finally, the frequency of all the text information is obtained according to the known output frequency of each block, and then the optimal feature subset existing in the whole text information is determined according to the comparison with the maximum and minimum thresholds.
In general, the selection of the optimal feature subgraph mainly uses the MR-MGSSL algorithm.
MR -MGSSL algorithm
The method of solving the optimal feature subgraphs is usually with a small see big. The basic idea is to output the multi-frequency subgraphs of each part first, and then obtain the characteristic subgraphs of the partial frequency subgraphs, and finally obtain the optimal characteristic subgraph of the whole text information. The specific calculation method is as follows.
Input: information of optimal characteristic subgraph \( H= list\left(u,{N}_{(u)},{NRE}_u^1\right),{N}_y=\left\{{NE}_1,\dots, {NE}_{NY}\right\} \) and Ey = {E1, …, ENY} Output: Optimal characteristic subgraph H and NE, Feature vector set U based on H. 1. U = φ 2. WhenNE1 ∈ NGy, continue 3. Zero dimensional vector of H is represented with θ 4.uh ∈ H1, continue 5. When\( {NE}_1\in {YNE}_{uh}^1 \), continue 6. Set 1 as the weight of θ 7. U = U ∪ {θ}; |
Map vectorization generally through the following steps to test.
Input: test multi mapNy = {NE1, …, NEj}. Output: Test the corresponding matrix of multi map, 1.US = φ; 2.When NEi ∈ NEs, continue 3.Set the corresponding vector of NEi as ui 4.ui = EU(HE, NE) 5.US = US ∪ {ui}. |
Map vectorization is realized by the vector of each block multi-frequency subgraph, namely in the first end part of the above input and output for each feature sub-block multi-frequency subgraph; then, at the reduced end, get Bag − list and Gra − list, finally obtain all the sub-images of text information, and conduct vectorization of the trained multi-map.