GRU basics
Gate Recurrent Unit (GRU) is a kind of recurrent neural network. The GRU neural network is a variant of the LSTM neural network (long short-term memory network), which is based on the RNN (recurrent neural network), a well-established machine learning method that has great advantages in processing temporal series [24]. The RNN contains a signal feedback structure that correlates the output information at time t with the information before time t, and has dynamic features and memory functions.
Figure 4 shows the structure of the RNN. As can be seen from the figure: ① the RNN structure includes an input hidden layer and an output layer, where the hidden layer contains the feedback structure; ② the output value at time t is the result of the joint action of the input information at that time and the time before; ③ the RNN can effectively analyze and process short time series, but cannot analyze and process time series with too long a dimension, otherwise it will produce "gradient disappearance RNNs can effectively analyze and process shorter time-series, but cannot analyze and process time-series with too long a dimension, otherwise it will result in gradient disappearance or gradient explosion [25]. To solve this issue, [26] proposed an RNN improved structure LSTM neural network, whose hidden layer structure is shown in Fig. 5.The LSTM neural network achieves memory controllability in temporal order based on the memory units (forgetting gate, input gate and output gate) in the hidden layer, it resolves the issue of RNN's inadequate long-term memory, but its hidden layer structure is too complex and the sample training takes a lot of time [27]. Based on the LSTM neural network, [28] proposed a GRU neural network, using reset gates and update gates instead of forgetting gates, input gates and output gates in the LSTM neural network. Although the hidden layer data flows in the LSTM and GRU neural networks are comparable, the GRU neural network lacks a dedicated storage unit, making sample training more effective.
\(h_{{\text{t}}}\) is the value of the hidden layer at moment t, and \(h_{{{\text{t}} - 1}}\) is the value of the hidden layer before moment t; \(o_{{\text{t}}}\) is the value of the output layer at moment t, and \(o_{{{\text{t}} - 1}}\) is the value of the output layer before moment t; W is the value of the hidden layer previously used as this input's weight matrix, and the input layer to the hidden layer's weight matrix is represented by U., V represents the weight matrix from the output layer to the hidden layer; each circle represents a neuron. For a regular RNN hidden layer, when given the input value \(x_{{\text{t}}} \left( {t = 1,2, \ldots ,n} \right)\), the output values of the output layer, hidden layer at moment t can be calculated by following a series of equations:
$$f = \tanh \left( x \right) = \frac{{e^{x} - e^{ - x} }}{{e^{x} + e^{ - x} }}$$
(1)
$$h_{{\text{t}}} = f\left( {U \cdot x_{{\text{t}}} + W \cdot h_{{{\text{t}} - 1}} } \right)$$
(2)
$$g = {\text{sigmoid}}\left( x \right) = \frac{1}{{1 + e^{ - x} }}$$
(3)
$$o_{{\text{t}}} = g\left( {V \cdot h_{{\text{t}}} } \right)$$
(4)
For a regular LSTM hidden layer, when given the input value \(x_{{\text{t}}} \left( {t = 1,2, \ldots ,n} \right)\), the output values of the output layer, hidden layer at moment t can be calculated by following a series of equations.
$$f_{{\text{t}}} = g\left( {W_{{\text{f}}} \cdot h_{{{\text{t}} - 1}} + U_{{\text{f}}} \cdot x_{{\text{t}}} } \right)$$
(5)
$$i_{{\text{t}}} = g\left( {W_{i} \cdot h_{{{\text{t}} - 1}} + U_{i} \cdot x_{{\text{t}}} } \right)$$
(6)
$$a_{{\text{t}}} = f\left( {W_{{\text{a}}} \cdot h_{{{\text{t}} - 1}} + U_{{\text{a}}} \cdot x_{{\text{t}}} } \right)$$
(7)
$$c_{{\text{t}}} = c_{{{\text{t}} - 1}} \odot f_{{\text{t}}} + i_{{\text{t}}} \odot a_{{\text{t}}}$$
(8)
$$o_{{\text{t}}} = g\left( {W_{{\text{O}}} \cdot h_{{{\text{t}} - 1}} + U_{{\text{O}}} \cdot x_{{\text{t}}} } \right)$$
(9)
$$h_{{\text{t}}} = o_{{\text{t}}} \odot f\left( {c_{{\text{t}}} } \right)$$
(10)
where \(f_{{\text{t}}}\), \(i_{{\text{t}}}\) and \(c_{{\text{t}}}\) represent forget gate, input gate and cell update, respectively. \(W_{{\text{f}}}\), \(W_{i}\) and \(W_{{\text{a}}}\) represent the relationship coefficient of each gate.
Figure 6 shows the hidden layer structure of GRU neural network. As can be seen from the diagram: the update gate regulates how much information from a prior instant influences the information in the present moment; the greater the update gate's value, the less influence past knowledge has on the present. The reset gate regulates how much information is taken in from the previous instant; the higher the value of the reset gate, the more information is taken in.
"1-" means that the vector's elements are each deducted by one. The hidden layer's value \(h_{{\text{t}}}\) at time t is more likely to be impacted by the candidate value \(\tilde{h}_{{\text{t}}}\) at time t than by the hidden layer's value \(h_{{{\text{t}} - 1}}\) at time t-1 if the update gate's value \(z_{{\text{t}}}\) at time t is bigger. If the value of \(z_{{\text{t}}}\) is taken to be approximately 1, this indicates that the value of the hidden layer at instant t-1, \(h_{{{\text{t}} - 1}}\), does not affect the value of the hidden layer at moment t, \(h_{{\text{t}}}\). The update gate facilitates a better representation of the influence of data with a wider time range in the temporal series on the current moment. For the value \(r_{{\text{t}}}\) of the reset gate at moment t, a larger value means that the candidate value \(\tilde{h}_{{\text{t}}}\) at moment t is more influenced by the value \(h_{{{\text{t}} - 1}}\) of the current concealed layer t a 1. If the value of \(r_{{\text{t}}}\) is approximately zero, it means that the value of \(h_{{{\text{t}} - 1}}\) in the covert layer at time t-1 does not contribute to the candidate value \(\tilde{h}_{{\text{t}}}\) at time t. The reset gate helps to better reflect the influence of the shorter intervals in the temporal series on the current moment.
Given the input value \(x_{{\text{t}}} \left( {t - 1,2, \ldots ,n} \right)\), the value of the hidden layers at instant t for the GRU neural network is [27]
$$z_{{\text{t}}} = g\left( {W_{z} \cdot \left[ {h_{{{\text{t}} - 1}} ,\;x_{{\text{t}}} } \right]} \right)$$
(11)
$$r_{{\text{t}}} = g\left( {W_{{\text{r}}} \cdot \left[ {h_{{{\text{t}} - 1}} } \right.,\;\left. {x_{{\text{t}}} } \right]} \right)$$
(12)
$$\tilde{h}_{{\text{t}}} = f\left( {W_{{\tilde{h}}} \cdot \left[ {r_{{\text{t}}} \bigcirc h_{{{\text{t}} - 1}} } \right.,\;\left. {x_{{\text{t}}} } \right]} \right)$$
(13)
$$h_{{\text{t}}} = \left( {1 - z_{{\text{t}}} } \right)\bigcirc h_{{{\text{t}} - 1}} + z_{{\text{t}}} \bigcirc \tilde{h}_{{\text{t}}}$$
(14)
where: "[]" means two vectors are connected; "○" is a calculation method between matrices, which means multiply by elements, when "○" acts on two vectors, the operation is
$$a\bigcirc {\text{b}} = \left[ {\begin{array}{*{20}c} {a_{1} } \\ {a_{2} } \\ {a_{3} } \\ \vdots \\ {a_{n} } \\ \end{array} } \right]\bigcirc \left[ {\begin{array}{*{20}c} {b_{1} } \\ {b_{2} } \\ {b_{3} } \\ \vdots \\ {b_{n} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {a_{1} b_{1} } \\ {a_{2} b_{2} } \\ {a_{3} b_{3} } \\ \vdots \\ {a_{n} b_{n} } \\ \end{array} } \right]$$
(15)
From Eqs. (1) to (4), it can be seen that the weight matrices at time \(t\) for which the GRU neural network needs to be trained are \(W_{z}\), \(W_{r}\) and \(W_{{\tilde{h}}}\), which are combined by two weight matrices, respectively, i.e.,
$$W_{z} = W_{zx} + W_{{z\tilde{h}}}$$
(16)
$$W_{r} = W_{rx} + W_{{r\tilde{h}}}$$
(17)
$$W_{{\tilde{h}}} = W_{{\tilde{h}x}} + W_{{\tilde{h}\tilde{h}}}$$
(18)
where \(W_{zx}\), \(W_{rx}\) and \(W_{{\tilde{h}x}}\) are the weight matrices of the input value to the update gate, the input value to the reset gate and the input value to the candidate value, respectively; \(W_{zx}\), \(W_{rx}\) and \(W_{{\tilde{h}x}}\) are the weight matrices of the last candidate value to the update gate, a reset gate's last candidate value and a candidate value's latest candidate value, respectively.
The training method for GRU neural networks is based on back-propagation theory and consists of four main steps.
-
(1)
Forward computation of each neuron's output value.
-
(2)
Back-propagation of the error term for each neuron the back-propagation of the GRU neural network error term consists of two aspects: one is the back-propagation along time, i.e., the calculation of the error term for each moment from the current moment, and the other is the transfer of the error term to the previous layer.
-
(3)
Based on the error term, the corresponding weight gradient is calculated using the optimization algorithm.
-
(4)
Update the weights using the obtained gradients. In this paper, stochastic gradient descent (SGD) is used to calculate the weight gradients. The ordinary batch gradient descent (BGD) method computes all the samples in each iteration and then updates the gradient; the SGD algorithm computes a random set of samples and updates the gradient. Compared with the BGD algorithm, the SGD algorithm is able to avoid falling into local extremes during the computation process, but does not need to compute all samples in each iteration, which can balance computational efficiency and computational accuracy.
Blockchained supply chain prediction
Figure 7 shows the flow of blockchained supply chain prediction based on GRU neural network. A hidden layer, an output layer, and an input layer are present, as shown in the picture. The input layer performs outlier processing and normalization of the storage layer parameters and feeds the processed data into the hidden layer. The goal of normalization is to keep the input data's maximum and lowest values within the bounds of the functions for the hidden layer and the output layer. The normalization formula used in this paper is
$$\begin{array}{*{20}c} {\tilde{x}_{i} = \frac{{x_{i} - x_{{{\text{min}}}} }}{{x_{{{\text{max}}}} - x_{{{\text{min}}}} }}} & {i = 1,2, \ldots ,n} \\ \end{array}$$
(19)
where \(x_{{{\text{min}}}}\) and \(x_{{{\text{max}}}}\) are the minimum and maximum values of xi respectively.
When training the neural network, the hidden layer receives the data and uses the constructed GRU neural network to calculate and pass the results to the output layer; the output layer receives the calculation results and performs denormalization to provide the output results; the output results are compared with the sample values and the weight coefficients of the hidden layer are iteratively updated until the end of training. When performing neural network prediction, the hidden layer receives the data and uses the trained GRU neural network to calculate and pass the calculation results to the output layer; the output layer receives the calculation results and performs the inverse normalization to provide the cross-wave velocity information.
Multi-head attention mechanism
To further improve the performance, we add to GRU the concept of Multi-Head Attention (MHA). Their research discovered that it is advantageous to use multi-head attention for the queries, values, and keys by using an attention layer as a function, which maps a query and a set of key-value pairs to the output. The multi-head attention layer computes the hidden information by linearly projecting the context vectors into several subspaces, performing better than single-head attention. We generate the output using weighted values, which are determined by queries and the related keys.
The time-dimension calculation for attention weighting is given by
$$s_{{\text{t}}} = {\text{softmax}}\left( {o_{{\text{last }}} \times \left( {o_{{\text{all }}} \times W_{{\text{t}}} } \right)^{{\text{H}}} } \right),\quad o_{{\text{last }}} \in R^{{{\text{B}},1,{\text{Z}}}}$$
(20)
$$o_{{\text{t}}} = s_{{\text{t}}} \times o_{{\text{all }}} ,\;o_{{\text{all }}} \in R^{{{\text{B,}}\;{\text{T,}}\;{\text{Z}}}} ,\;s_{{\text{t}}} \in R^{{{\text{B}},\;1,\;{\text{T}}}}$$
(21)
where \(s_{t}\) contributes the time-dimension's attention score, \(o_{{\text{last }}}\) stands for the most recent output, and \(o_{{\text{all }}}\) refers to the total output. \(T\) stands for the number of time steps, \(B\) for the batch size, and \(Z\) for the feature dimension. The most recent time step is represented by parameter \(1\). The output of the time-dimension attention layer is donated by \(o_{t}\), while \(H\) stands for the transpose operator, \(W_{t}\) for the parameter matrix, and \(W_{t}\) for the transpose operator.
Single-Head Attention calculation is shown in Eqs. 20 and 21. For attention, we simply employ two GRU output varieties. The fact that the output of all time comprises data from every GRU output makes it crucial. The last time step output was chosen since it has the most redundant data of all the time steps. In order to calculate the queries, keys, and values for multi-head time-dimension attention computing, we also select the following two forms of output:
$$K_{i} = W_{i,k} \times o_{{\text{all }}} + b_{i,k} ,\;K_{i} \in R^{{{\text{B,}}\;{\text{T}},\frac{{\text{Z}}}{n}}} ,\;W_{i,k} \in R^{{{\text{Z}},\frac{{\text{Z}}}{n}}} ,\;b_{i,k} \in R^{{\frac{{\text{Z}}}{n}}}$$
(22)
$${\text{V}}_{{\text{i}}} = {\text{W}}_{{{\text{i}},{\text{v}}}} \times {\text{o}}_{{\text{all }}} + {\text{b}}_{{{\text{i}},{\text{v}}}} ,\;{\text{V}}_{{\text{i}}} \in {\text{R}}^{{{\text{B}},{\text{T}},\frac{{\text{Z}}}{{\text{n}}}}} ,\;{\text{W}}_{{{\text{i}},{\text{v}}}} \in {\text{R}}^{{{\text{Z}},\frac{{\text{Z}}}{{\text{n}}}}} ,\;{\text{b}}_{{{\text{i}},{\text{v}}}} \in {\text{R}}^{{\frac{{\text{Z}}}{{\text{n}}}}}$$
(23)
$$Q_{i} = W_{i,q} \times o_{{\text{last }}} + b_{i,q} ,\;Q_{i} \in R^{{{\text{B}},1\frac{{\text{Z}}}{n}}} ,\;W_{i,q} \in R^{{{\text{Z}},\frac{{\text{Z}}}{n}}} ,\;b_{i,q} \in R^{{\frac{{\text{Z}}}{n}}}$$
(24)
where \(K,V,Q\) represent the value, key, and query, respectively. \(n\) is the number of attention heads and \(b\) means bias.
The following formulas are used to calculate the context vectors and C for Multi-Head Attention scores.
$$s_{i} = {\text{softmax}}\left( {Q_{i} \times K_{i}^{{\text{H}}} } \right),\;s_{i} \in R^{{\text{B,1,T}}}$$
(25)
$${\text{context}}_{i} = s_{i} \times V_{i} ,\;{\text{context}}_{i} \in R^{{{\text{B}},1,\frac{{\text{Z}}}{n}}}$$
(26)
$$C = {\text{ Concat }}\left( {\left[ {{\text{context}}_{1} , \ldots ,{\text{context}}_{n} } \right]} \right),\;C \in R^{{\text{B,1,Z}}}$$
(27)
where \({\text{context}}_{i}\) denotes the reduced-dimension \({\text{context}}_{i}\) vectors from each subspace and \(s_{i}\) denotes the multi-head time-dimension attention score. Figure 8 illustrates the general organization of multi-head time-dimension attention. The context vector is then inserted into the complete connection layer. The softmax layer receives the output and makes the final prediction.