Network security situation prediction based on feature separation and dual attention mechanism

Li, Zhijian; Zhao, Dongmei; Li, Xinghua; Zhang, Hongbin

doi:10.1186/s13638-021-02050-x

Research
Open access
Published: 21 September 2021

Network security situation prediction based on feature separation and dual attention mechanism

Zhijian Li^1,2,
Dongmei Zhao^1,2,
Xinghua Li³ &
…
Hongbin Zhang⁴

EURASIP Journal on Wireless Communications and Networking volume 2021, Article number: 177 (2021) Cite this article

2839 Accesses
3 Citations
Metrics details

Abstract

With the development of smart cities, network security has become more and more important. In order to improve the safety of smart cities, a situation prediction method based on feature separation and dual attention mechanism is presented in this paper. Firstly, according to the fact that the intrusion activity is a time series event, recurrent neural network (RNN) or RNN variant is used to stack the model. Then, we propose a feature separation method, which can alleviate the overfitting problem and reduce cost of model training by keeping the dimension unchanged. Finally, limited attention is proposed according to global attention. We sum the outputs of the two attention modules to form a dual attention mechanism, which can improve feature representation. Experiments have proved that compared with other existing prediction algorithms, the method has higher accuracy in network security situation prediction. In other words, the technology can help smart cities predict network attacks more accurately.

1 Introduction

Smart city mainly uses new generation information technologies to plan, manage and build city. Currently, many cities are building smart cities, and network security is a problem that must be faced in the construction of smart cities. If smart cities do not have a good network security environment, then various technologies used in smart cities may be used by hackers, thus affecting the normal operation of smart cities. In order to realize and maintain the security and privacy of smart cities, we intend to use the technology of network security situation prediction.

In 1988, Endsley defined situation awareness in the paper [1]. In addition, Endsley also discussed the construction of situation awareness and divided situation awareness into three levels consisting of perception of elements in current situation, comprehension of current situation, and projection of future status. Bass et al. [2] believed that the data flow on the Internet is similar to the airplane flying in the air, and the future network management is also similar to the air traffic control (ATC). Affected by the situation awareness needed in ATC, they believed that the next generation of IDSes can combine short-term sensor data with long-term knowledge databases to create cyberspace situation awareness.

Later researchers proposed more than a dozen network security situation awareness models, but they all have three basic functions [3]. The first is the extraction of network security situation elements, the second is the evaluation of the network security situation, and the third is the prediction of the network security situation. The main work of this paper is network security situation prediction. The main contributions of this paper are as follows:

1.
We propose a feature separation (FS) method. Compared with the one-hot method, the method can alleviate the overfitting problem and reduce cost of model training by keeping the dimension unchanged.
2.
In this paper, according to global attention, we propose limited attention mechanism. We sum the outputs of the two attention modules to form a dual attention mechanism (DAM), which can improve feature representation.
3.
We use RNN, LSTM and GRU to validate our method. In addition, two data sets are used for experiments in the paper. It is also worth referring to the feature selection and preprocessing of the AWID data set.

The rest of this paper is arranged as follows. Section 2 roughly introduces the existing work. In Sect. 3, RNN, RNN variants and global attention mechanism are introduced. Section 4 introduces our work in detail. Section 5 introduces the results and experiments. Finally, Sect. 6 draws the conclusion of the paper, and the future direction of the work is explained.

2 Related work

There are many researches on network security situation prediction. Some researchers use traditional machine learning methods, while others combine intelligent optimization algorithm and machine learning. Currently, many researchers use deep learning-related algorithms to study network security situation prediction.

Wang et al. [4] compared several network security situation prediction methods. The results show that the radial basis neural network optimized by particle swarm optimization algorithm had better performance, but the number of data samples used in the experiment was too small. Zhang et al. [5] proposed a network security situation prediction method based on BP neural network. The authors used seeker optimization algorithm to find the best weight and introduced simulated annealing algorithm to improve the global search ability of the algorithm, but BP neural network is not suitable for time series data. Hu et al. [6] proposed a situation prediction algorithm of SVM based on MapReduce to solve the problem that the training time of SVM is long when there are a large number of samples. Compared with the traditional model, this prediction model improves the prediction accuracy and reduces the training time. Aiming at the slow convergence of the adaptive genetic algorithm in wavelet neural network, Zhang et al. [7] proposed an improved niche genetic algorithm for parameter optimization of wavelet neural network. The results show that the improved algorithm has faster convergence speed and higher accuracy.

As network security data are time series data, many scholars use RNN or its variants for situation prediction. Zhu et al. [8] proposed an improved LSTM algorithm for situation awareness. In order to speed up the convergence speed of the improved LSTM and reduce the learning time, based on the Nadam optimization algorithm, they proposed the Nadam with Look-ahead (NAWL) algorithm. Experiments show that the algorithm can get a relatively small mean square error. If the algorithm is added to the attention mechanism, it can be expected to have better results. Shang et al. [9] proposed a simplified bidirectional LSTM. Compared with the traditional LSTM, this algorithm has better prediction effect, but the authors did not consider the extraction of training set and test set. Hu et al. [10] proposed a network security situation prediction method based on RNN. The experiment shows that the method is feasible and has good prediction ability, but the number of samples used is too small. Li et al. [11] used LSTM to build a three-layer model with ReLU as the activation function. On the KDD CUP 99 data set, the experiments were also conducted with accuracy reaching beyond 90.56%. However, the accuracy is not satisfactory.

3 Basic theory

3.1 Variants of RNN

Recurrent neural network is an extension of traditional feedforward neural network and has the ability to process variable-length sequence input [12]. Long short-term memory (LSTM) is a variant of recurrent neural network, which is used to solve the problem of gradient vanishing, and gated recurrent unit (GRU) is a simplification of LSTM. The next two sections will briefly review LSTM and GRU.

3.1.1 Long short-term memory

The long short-term memory is a variant of RNN proposed by Hochreiter and Schmidhuber. In the section, the description of LSTM follows the description of LSTM in Chung et al. [12]. Given the input $X = (x_{1}, x_{2},\ldots ,x_{T^{'}})$, we can describe the LSTM as follows.

Firstly, LSTM has three gates, which are forget gate $f_{t}$, input gate $i_{t}$, and output gate $o_{t}$.

$$\begin{aligned} & f_{t} = \sigma (W_{f}x_{t}+U_{f}h_{t-1}+V_{f}c_{t-1}) \end{aligned}$$

(1)

$$\begin{aligned} & i_{t} = \sigma (W_{i}x_{t}+U_{i}h_{t-1}+V_{i}c_{t-1}) \end{aligned}$$

(2)

$$\begin{aligned} & o_{t} = \sigma (W_{o}x_{t}+U_{o}h_{t-1}+V_{o}c_{t}) \end{aligned}$$

(3)

where $\sigma$ is the sigmoid activation function and $t\in [1,T^{'}]$. $T^{'}$ is the time step. $W_{f}$, $W_{i}$, $W_{o}$, $U_{f}$, $U_{i}$ and $U_{o}$ are weight matrices. $V_{f}$, $V_{i}$ and $V_{o}$ are diagonal matrices. Then, the memory cell $c_{t}$ can be obtained by forget gate $f_{t}$ and input gate $i_{t}$.

$$\begin{aligned} c_{t} = f_{t}c_{t-1} + i_{t}\tilde{c}_{t} \end{aligned}$$

(4)

where forget gate $f_{t}$ means forgetting part of the information of the previous memory cell $c_{t-1}$, and input gate $i_{t}$ indicates that part of the information of a new memory cell $\tilde{c}_{t}$ is added to the memory cell $c_{t}$. The calculation formula of the new memory cell $\tilde{c}_{t}$ is shown as follows.

$$\begin{aligned} \tilde{c}_{t} = tanh(W_{c}x_{t}+U_{c}h_{t-1}) \end{aligned}$$

(5)

In this formulation, $W_{c}$ and $U_{c}$ are weight matrices. tanh is a activation function:

$$\begin{aligned} tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} \end{aligned}$$

(6)

Finally, the output $h_{t}$ can be obtained through the output gate $o_{t}$ and the memory cell $c_{t}$.

$$\begin{aligned} h_{t} =o_{t} tanh(c_{t}) \end{aligned}$$

(7)

The function of the output gate $o_{t}$ is to select appropriate information from memory cell for target prediction.

3.1.2 Gated recurrent unit

Gate current unit was proposed by Cho et al. [13] under the stimulation of LSTM. Compared with LSTM, gate current unit is simpler in calculation and implementation. Firstly, GRU has two gates, which are reset gate $r_{t}$ and update gate $z_{t}$.

$$\begin{aligned} & r_{t} = \sigma (W_{r}x_{t}+U_{r}h_{t-1}) \end{aligned}$$

(8)

$$\begin{aligned}& z_{t} = \sigma (W_{z}x_{t}+U_{z}h_{t-1}) \end{aligned}$$

(9)

where $\sigma$ is the sigmoid activation function. $W_{r}$, $W_{z}$, $U_{r}$ and $U_{z}$ are weight matrices. We can get a new hidden state $\tilde{h}_{t}$ through reset gate $r_{t}$.

$$\begin{aligned} \tilde{h}_{t} = tanh(Wx_{t}+U(r_{t}\odot h_{t-1})) \end{aligned}$$

(10)

In this formulation, W and U are weight matrices. tanh is the activation function shown in (6). $\odot$ stands for the Hadamard Product. Here, reset gate $r_{t}$ indicates how much information to obtain from the previous hidden state $h_{t-1}$. If the reset gate $r_{t}$ is 0, no information will be obtained from the previous hidden state.

Then, we can get the current hidden state $h_{t}$ through the update gate $z_{t}$.

$$\begin{aligned} h_{t} = z_{t}h_{t-1}+(1-z_{t}) \tilde{h}_{t} \end{aligned}$$

(11)

Update gate $z_{t}$ represents how much information is carried from the previous hidden state to the current hidden state. The $(1-z_{t})$ represents how much information is carried from the new hidden state $\tilde{h}_{t}$ to the current hidden state.

3.2 Global attention

Since Bahdanau et al. [14] introduced the attention mechanism to machine translation, the attention mechanism has become more and more popular and plays an increasingly important role in neural networks. Global attention is proposed by Luong et al. [15]. In this paper, given the input X=($x_{1}$,$x_{2},\ldots ,x_{T^{'}}$,$\beta$), we use the global attention shown in Fig. 1.

In Fig. 1, $\bar{h}_{s}$ is the s-th source hidden state, h is the target hidden state, and $\alpha$ is the alignment weight vector. Here, $\beta$ marks the end of the input. It is a feature vector. Based on $\bar{h}_{s}$ and h, $\alpha$ can be obtained by an align function.

$$\begin{aligned} &\alpha _{s} = align(h,\bar{h}_{s}) \\ &\quad =\frac{exp(score(h,\bar{h}_{s}))}{\sum _{s^{'}=1}^{T^{'}}exp(score(h,\bar{h}_{s^{'}}))} \end{aligned}$$

(12)

In this formulation, $\alpha _{s}$ is the weight of $\bar{h}_{s}$, $T^{'}$ is the time step, and score is a dot product function.

$$\begin{aligned} score(h,\bar{h}_{s}) = h^{T}\bar{h}_{s} \end{aligned}$$

(13)

A global context vector c is then computed as the weighted average, according to $\alpha$, over all the source hidden states.

$$\begin{aligned} c = \sum _{s^{'}=1}^{T^{'}}\alpha _{s^{'}}\bar{h}_{s^{'}} \end{aligned}$$

(14)

Finally, we employ a simple concatenation layer to combine the information from both vectors to get an attentional hidden state.

$$\begin{aligned} \tilde{h} = [c;h] \end{aligned}$$

(15)

4 Methods

In this section, we introduce our proposed feature separation (FS) and limited attention firstly, and then, we propose a network security situation prediction model based on feature separation and dual attention mechanism (FSDAM) according to global attention, feature separation and limited attention.

4.1 Feature separation

Since categorical data exist in the data set and cannot be fed directly into the model like the numerical data, many researchers use one-hot encoding method. However, the one-hot method will make the dimension of features larger, which will cause the model to train more parameters and overfit. In this paper, in order to keep the dimension of features unchanged, we propose a feature separation method based on word embedding. Since categorical data and numerical data are processed separately, we call this technique feature separation, or FS for short.

In order to clearly explain one-hot, word embedding and feature separation (FS), we artificially created several samples as shown in Table 1. ‘rsvp,’ ‘tcp’ and ‘udp’ are transaction protocols. ‘INT,’ ‘FIN’ and ‘CON’ indicate to the state and its dependent protocols.

Table 1 The samples we created

Network security situation prediction based on feature separation and dual attention mechanism

Abstract

1 Introduction

2 Related work

3 Basic theory

3.1 Variants of RNN

3.1.1 Long short-term memory

3.1.2 Gated recurrent unit

3.2 Global attention

4 Methods

4.1 Feature separation

4.2 Limited attention

4.3 Design of the FSDAM+GRU model

5 Results and discussion

5.1 Description of data sets

5.1.1 UNSW-NB15 data set

5.1.2 AWID data set

5.2 Data set preprocessing

5.2.1 Data filtration and feature selection

5.2.2 Standardization and feature separation

5.3 Model configuration and training

5.4 Experiments

5.4.1 Selection of hyper parameters

5.4.2 Comparison of one-hot and FS

5.4.3 Comparison of RNN, LSTM and GRU

5.4.4 Comparison of different methods

6 Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords