Network security situation prediction based on feature separation and dual attention mechanism

With the development of smart cities, network security has become more and more important. In order to improve the safety of smart cities, a situation prediction method based on feature separation and dual attention mechanism is presented in this paper. Firstly, according to the fact that the intrusion activity is a time series event, recurrent neural network (RNN) or RNN variant is used to stack the model. Then, we propose a feature separation method, which can alleviate the overfitting problem and reduce cost of model training by keeping the dimension unchanged. Finally, limited attention is proposed according to global attention. We sum the outputs of the two attention modules to form a dual attention mechanism, which can improve feature representation. Experiments have proved that compared with other existing prediction algorithms, the method has higher accuracy in network security situation prediction. In other words, the technology can help smart cities predict network attacks more accurately.

by the situation awareness needed in ATC, they believed that the next generation of IDSes can combine short-term sensor data with long-term knowledge databases to create cyberspace situation awareness.
Later researchers proposed more than a dozen network security situation awareness models, but they all have three basic functions [3]. The first is the extraction of network security situation elements, the second is the evaluation of the network security situation, and the third is the prediction of the network security situation. The main work of this paper is network security situation prediction. The main contributions of this paper are as follows: 1. We propose a feature separation (FS) method. Compared with the one-hot method, the method can alleviate the overfitting problem and reduce cost of model training by keeping the dimension unchanged. 2. In this paper, according to global attention, we propose limited attention mechanism.
We sum the outputs of the two attention modules to form a dual attention mechanism (DAM), which can improve feature representation. 3. We use RNN, LSTM and GRU to validate our method. In addition, two data sets are used for experiments in the paper. It is also worth referring to the feature selection and preprocessing of the AWID data set.
The rest of this paper is arranged as follows. Section 2 roughly introduces the existing work. In Sect. 3, RNN, RNN variants and global attention mechanism are introduced. Section 4 introduces our work in detail. Section 5 introduces the results and experiments. Finally, Sect. 6 draws the conclusion of the paper, and the future direction of the work is explained.

Related work
There are many researches on network security situation prediction. Some researchers use traditional machine learning methods, while others combine intelligent optimization algorithm and machine learning. Currently, many researchers use deep learning-related algorithms to study network security situation prediction. Wang et al. [4] compared several network security situation prediction methods. The results show that the radial basis neural network optimized by particle swarm optimization algorithm had better performance, but the number of data samples used in the experiment was too small. Zhang et al. [5] proposed a network security situation prediction method based on BP neural network. The authors used seeker optimization algorithm to find the best weight and introduced simulated annealing algorithm to improve the global search ability of the algorithm, but BP neural network is not suitable for time series data. Hu et al. [6] proposed a situation prediction algorithm of SVM based on MapReduce to solve the problem that the training time of SVM is long when there are a large number of samples. Compared with the traditional model, this prediction model improves the prediction accuracy and reduces the training time. Aiming at the slow convergence of the adaptive genetic algorithm in wavelet neural network, Zhang et al. [7] proposed an improved niche genetic algorithm for parameter optimization of wavelet neural network. The results show that the improved algorithm has faster convergence speed and higher accuracy.
As network security data are time series data, many scholars use RNN or its variants for situation prediction. Zhu et al. [8] proposed an improved LSTM algorithm for situation awareness. In order to speed up the convergence speed of the improved LSTM and reduce the learning time, based on the Nadam optimization algorithm, they proposed the Nadam with Look-ahead (NAWL) algorithm. Experiments show that the algorithm can get a relatively small mean square error. If the algorithm is added to the attention mechanism, it can be expected to have better results. Shang et al. [9] proposed a simplified bidirectional LSTM. Compared with the traditional LSTM, this algorithm has better prediction effect, but the authors did not consider the extraction of training set and test set. Hu et al. [10] proposed a network security situation prediction method based on RNN. The experiment shows that the method is feasible and has good prediction ability, but the number of samples used is too small. Li et al. [11] used LSTM to build a threelayer model with ReLU as the activation function. On the KDD CUP 99 data set, the experiments were also conducted with accuracy reaching beyond 90.56%. However, the accuracy is not satisfactory.

Variants of RNN
Recurrent neural network is an extension of traditional feedforward neural network and has the ability to process variable-length sequence input [12]. Long short-term memory (LSTM) is a variant of recurrent neural network, which is used to solve the problem of gradient vanishing, and gated recurrent unit (GRU) is a simplification of LSTM. The next two sections will briefly review LSTM and GRU.

Long short-term memory
The long short-term memory is a variant of RNN proposed by Hochreiter and Schmidhuber. In the section, the description of LSTM follows the description of LSTM in Chung et al. [12]. Given the input X = (x 1 , x 2 , . . . , x T ′ ) , we can describe the LSTM as follows.
Firstly, LSTM has three gates, which are forget gate f t , input gate i t , and output gate o t .
where σ is the sigmoid activation function and t ∈ [1, (1) where forget gate f t means forgetting part of the information of the previous memory cell c t−1 , and input gate i t indicates that part of the information of a new memory cell c t is added to the memory cell c t . The calculation formula of the new memory cell c t is shown as follows.
In this formulation, W c and U c are weight matrices. tanh is a activation function: Finally, the output h t can be obtained through the output gate o t and the memory cell c t .
The function of the output gate o t is to select appropriate information from memory cell for target prediction.

Gated recurrent unit
Gate current unit was proposed by Cho et al. [13] under the stimulation of LSTM. Compared with LSTM, gate current unit is simpler in calculation and implementation. Firstly, GRU has two gates, which are reset gate r t and update gate z t .
where σ is the sigmoid activation function. W r , W z , U r and U z are weight matrices. We can get a new hidden state h t through reset gate r t .
In this formulation, W and U are weight matrices. tanh is the activation function shown in (6). ⊙ stands for the Hadamard Product. Here, reset gate r t indicates how much information to obtain from the previous hidden state h t−1 . If the reset gate r t is 0, no information will be obtained from the previous hidden state.
Then, we can get the current hidden state h t through the update gate z t .
Update gate z t represents how much information is carried from the previous hidden state to the current hidden state. The (1 − z t ) represents how much information is carried from the new hidden state h t to the current hidden state.

Global attention
Since Bahdanau et al. [14] introduced the attention mechanism to machine translation, the attention mechanism has become more and more popular and plays an increasingly important role in neural networks. Global attention is proposed by Luong et al. [15]. In this paper, given the input X=(x 1 ,x 2 , . . . , x T ′,β ), we use the global attention shown in Fig. 1.
In Fig. 1, h s is the s-th source hidden state, h is the target hidden state, and α is the alignment weight vector. Here, β marks the end of the input. It is a feature vector. Based on h s and h, α can be obtained by an align function.
In this formulation, α s is the weight of h s , T ′ is the time step, and score is a dot product function.
A global context vector c is then computed as the weighted average, according to α , over all the source hidden states.
Finally, we employ a simple concatenation layer to combine the information from both vectors to get an attentional hidden state. (12)

Methods
In this section, we introduce our proposed feature separation (FS) and limited attention firstly, and then, we propose a network security situation prediction model based on feature separation and dual attention mechanism (FSDAM) according to global attention, feature separation and limited attention.

Feature separation
Since categorical data exist in the data set and cannot be fed directly into the model like the numerical data, many researchers use one-hot encoding method. However, the one-hot method will make the dimension of features larger, which will cause the model to train more parameters and overfit. In this paper, in order to keep the dimension of features unchanged, we propose a feature separation method based on word embedding. Since categorical data and numerical data are processed separately, we call this technique feature separation, or FS for short.
In order to clearly explain one-hot, word embedding and feature separation (FS), we artificially created several samples as shown in Table 1. 'rsvp, ' 'tcp' and 'udp' are transaction protocols. 'INT, ' 'FIN' and 'CON' indicate to the state and its dependent protocols.
The one-hot encoding method uses a vector to represent a word. Length of the vector is number of words in the vocabulary. Each word has a unique vector. If index of the word in the vocabulary is i, the vector is 1 at i, and the remaining positions are 0. For example, in Table 1, 'rsvp' can be represented by '1 0 0. ' According to onehot encoding method, Table 2 can be obtained. We can see that the dimension has changed from 3 to 7.
The word embedding generates vector representation of each word, the length of which is a hyperparameter. If embedding size is 3, in Table 1, 'rsvp' can be represented by '1.8 0.4 0.1. ' According to word embedding method, Table 3 can be obtained.
0.11 udp CON Table 2 The samples according to one-hot encoding method Sample Numerical data 1 2 3 4 5 6 7 Word embedding is only one phase of feature separation. Figure 2 shows the working progress of feature separation (FS). Firstly, x 1 is divided into x 1 and x 1 .Then, x 1 performs a linear transformation and sums by rows to get h 1 . Finally, we employ a simple concatenation layer to combine the information from both vectors to get x ′ 1 . The dimension of x ′ 1 is the same as that of x 1 . The architecture of the feature separation is shown in Fig. 3 where x ij is the j-th feature vector in the categorical data, and n means there are n features in the categorical data. In order to get n features, x i performs a linear transformation with tanh as the activation function. Table 3 The samples according to word embedding method Sample 1 2 3 where W is a weight matrix and b is a bias term; H i ∈ R n×n is a matrix. Then, sum or mean is performed. In the work of this paper, sum is used by default.
In this formulation, h i is the numerical representation of categorical data. Finally, we employ a simple concatenation layer to combine the information from both vectors to get x ′ i .

Limited attention
In Section 3.2, we introduced the global attention mechanism. In this section, based on the global attention mechanism, we propose a limited attention mechanism as shown in Fig. 4. When model is a deep network, limited attention can alleviate the gradient vanishing problem by using the final output of each layer. Given a network with L = 5 hidden layers, we can define the following attention mechanism.
Firstly, we employ a slice function to process the layer hidden state h j .
In this formulation, h j is the hidden state of the j-th hidden layer and k is the dimension of the hidden state h ′ in the last hidden layer. The slice function has two cases: (1) Hidden  Figure 5 shows the working progress of slice function.
Based on the hidden state h ′ , we can get the weight α ′ j of h ′ j through an align function.
where score is a dot product function.
A context vector c ′ is then computed as the weighted average, according to α ′ , over all the layer hidden states.
Finally, we employ a simple concatenation layer to combine the information from both vectors to get an attentional hidden state.
Compared with global attention, limited attention has a slice function. Only a part of each hidden state is used when the dimensions are different, so we call it limited attention. Our limited attention uses the hidden state of the final output of each layer, focusing on the impact of the network layer. The work in this paper is the first case in Fig. 5.

Design of the FSDAM+GRU model
Before this section, we have introduced gated recurrent unit, global attention, feature separation, and limited attention. According to these methods, we design the FSDAM+GRU model shown in Fig. 6. The model is mainly composed of feature separation (FS), gated recurrent unit (GRU) and dual attention mechanism (DAM). The dual attention mechanism is composed of global attention and limited attention. Inside the model, we also use shortcut connections [16]. Firstly, given the input X=(x 1 ,x 2 ,...,x T ′,β ), we can use the feature separation (FS) layer to get X Then, X ′ is input to the five-layer GRU network for training.
The outputs of GRU are input to the global attention and the limited attention. Two attentional hidden states h and h ′ are obtained. Finally, we sum the outputs of the two attention modules. The output of fusion is connected to the full connection layer to realize network security situation prediction.

Description of data sets
In this paper, we are going to use two data sets, one is the UNSW-NB15 data set [17], and the other is the AWID data set created by Kolias et al. [18] in 2015. The following two subsections will describe the UNSW-NB15 and AWID data sets in detail.

UNSW-NB15 data set
To overcome the shortcomings of the KDD CUP 99 data set, Moustafa et al. [17] created the UNSW-NB15 data set in 2015. There are ten traffic data types in this data set, which are Normal, Dos, Fuzzers, Analysis, Exploits, Reconnaissance, Worm, Backdoors, Generic and Shellcode. Each sample contains 49 attributes, two of which are labels, one label is used for multi-classification, and the other is used for two-class classification. In addition, the data set has four files, which are UNSW-NB15_1.csv, UNSW-NB15_2. csv, UNSW-NB15_3.csv and UNSW-NB15_4.csv. The first three CSV files each file contains 700000 records, and the fourth file contains 440044 records. All records are sorted according to the last time attribute. Note that, during actual use, we found that the first three CSV files each file actually contains 700001 records.

Fig. 6 FSDAM+GRU model
In our experiments, we use the files UNSW-NB15_3.csv and UNSW-NB15_4.csv as our training set and test set. Detailed information about the training set and test set can be seen in Table 4.

AWID data set
The Aegean WiFi Intrusion Dataset (AWID) [18] is extracted from a dedicated WEP protected 802.11 network with real network traffic, not manual traffic. It contains a lot of normal traffic and attack traffic for 802.11 network. Compared with KDD CUP 99, AWID is a relatively new data set. In the data set, each sample has 154 features. According to the number of samples, the data set is divided into two versions, which are full subset (AWID-ATK-F and AWID-CLS-F) and reduced subset (AWID-ATK-R and AWID-CLS-R). The two versions are extracted from the real network.
In the experiments, we use AWID-CLS-R-Trn and AWID-CLS-R-Tst as our training set and test set, respectively. The AWID-CLS-R-Trn data set is obtained by monitoring the network for 1 h. The first 45 min is normal traffic, and the next 15 min contains attack traffic. In the training set, there are a total of 1,795,575 records, including 1,633,190 normal records and 162,385 attack records. In the test set, there are a total of 575,643 records, including 530,785 normal records and 44,858 attack records. Table 5 shows detailed information about the training set and test set.

Data set preprocessing
In order to meet the requirements of input format, data set preprocessing is necessary. For AWID data set, there are many missing values and features, which cannot be directly used. In order to improve the prediction accuracy, how to deal with missing values and how to select useful features for training also need to be studied in detail.
In this section, preprocessing includes data filtration and feature selection, feature separation and standardization.

Data filtration and feature selection
Due to the problems of missing values and many attributes in the AWID data set, we provide our solution to solve these problems. Our solution is as follows: 1. In phase one, we discard the features that account for more than 90% of missing values in AWID-CLS-R-Trn. In the phase, 46 features are discarded. 2. In phase two, we discard eight useless features, which are frame.time_epoch(#4), wlan.ra (#76), wlan.da (#77), wlan.ta (#78), wlan.sa (#79), wlan.bssid (#80), wlan. wep.iv (#140) and wlan.wep.icv (#142). For example, wlan.ra (#76), wlan.da (#77), wlan.ta (#78) and wlan.sa (#79) only represent the hardware MAC address. 3. In phase three, it is mainly to fill in the missing values of the data set. Numerical data fill in the mean at the missing places, and categorical data fill in miss i at the missing places, where i represents the i-th feature of categorical data to be filled. 4. In phase four, it is mainly to convert categorical data into numerical data. 5. In phase five, since the constant value has a limited effect on classification, we discard the features that are constant values. In the phase, 50 features are discarded.
According to the above five phases, there are only 50 features left in the 154 features of the AWID data set. Based on the previous work, this paragraph mainly focuses on feature selection. The main purpose of feature selection is to reduce dimension, eliminate irrelevant or redundant features, and then select important features. In [19], Chen et al. used extra trees for feature selection and selected 20 important features from the 154 features for intrusion detection. The results show that the method is effective for improving accuracy. We ran the algorithm and extracted the 38 most important features from the AWID data set for our experiments. A detailed description of the data features is provided in Table 6. For the UNSW-NB15 data set, first of all, we discard six useless features, which are srcip (#1), sport (#2), dstip (#3), dsport (#4), stime (#29) and ltime (#30), where srcip (#1) and dstip (#3) only represent IP address. Then, because ct_flw_http_mthd (#38), is_ftp_login (#39) and ct_ftp_cmd (#40) have a large number of missing values, the three features are discarded. In the end, there are 38 features left in addition to two labels attributes. Table 7 shows the 38 features used in our experiments.

Standardization and feature separation
After obtaining 38 features, we redo some preprocessing. In the training set and test set of the AWID data set, numerical data fill in the mean at the missing place, and categorical data fill in the miss i at the missing place.
We use x to represent numerical data and x to represent categorical data. For numerical data, each feature is standardized. Standardization is as shown in (25).
where x ij is the j-th feature value of the i-th sample, µ j is the mean value of the j-th feature, and σ j is the standard deviation of the j-th feature.
For categorical data, we use word embedding technology which is widely used in the field of natural language processing.

Model configuration and training
The section mainly describes the configuration used by the model. In the paper, kerasgpu-2.3.1 and tensorflow-gpu-1.15.0 are used to build the structure of the model. In our model, we use softplus as the activation function.
In the training phase of the model, batch is set to 3072 , and time-step is set to 4. We use the mean, maximum and minimum values of accuracy obtained from five runs to measure our model. In addition, Adam [20] is used as the optimizer, and the parameters of Adam are set to lr=0.001, beta_1=0.9 and beta_2=0.999.

Selection of hyper parameters
In order to investigate the impact of the output size of each layer on the model performance, we designed six different hyper parameter configurations based on FSDAM+GRU model. UNSW-NB15 was used as the data set in the experiment. The results are shown in Table 8 and Fig. 7. Figure 7 shows boxplot of five results.
From the boxplot, we can see that the trend upward in terms of median classification accuracy (black dashed line on the box) from HyPa1 to HyPa5. There also appears to be a decrease at HyPa6. Perhaps HyPa5 would be a good configuration. According to the mean and maximum values in the table, it can also be clearly seen that HyPa5 is superior to others. Therefore, we will use configurations of HyPa5 for next experiments, as it will be efficient for improving accuracy.

Comparison of one-hot and FS
In this section, we made a detailed comparison between FS and one-hot. The results are shown in Tables 9, 10 and Fig. 8. 'Params' in the tables represents the total number of parameters for model training.
From results, we can see that different methods have a significant effect on performances of DAM+GRU. According to the mean, maximum and minimum values in each table, DAM+GRU using FS is superior to DAM+GRU using one-hot, which shows that FS can improve accuracy. According to the 'Params' in each table, we can also see that the number of parameters using FS is lower than that using one-hot. Due to the decrease in the number of parameters, FS can improve the training efficiency of DAM+GRU. Figure 8 can further illustrate that FS can improve accuracy, which is particularly obvious on the AWID data set.
The main reason is that the one-hot method will make the dimension of features larger, and the increase in dimension will lead to the redundancy of internal parameters. The model may overfit due to redundancy of parameters. On the UNSW-NB15 data set, the input dimension using FS is 38, while the input dimension using one-hot is 196. On the AWID data set, the input dimension using FS is 38, while the input dimension using one-hot is 95.

Comparison of RNN, LSTM and GRU
In this section, we measured the performance according to three different recurrent units. The experimental results based on the UNSW-NB15 data set are presented   Table 11. The experimental results based on the AWID data set are presented in Table 12. Figure 9 is the comparison results on the two data sets. From Fig. 9, we can find that FSDAM using LSTM is worse than others in terms of median classification accuracy (black dashed line on the box). In the two tables, the LSTM structure is also lower than other structures in terms of mean and maximum values. The LSTM structure uses more training parameters. Therefore, the LSTM structure is more likely to overfit, resulting in decreased accuracy. Compared with the LSTM structure, GRU and RNN are relatively simple, and they have higher time efficiency. Based on the experimental results, we recommended RNN and GRU.

Comparison of different methods
In this section, we compared our method with existing methods in detail. Tables 13, 14, 15 and 16 provide comparisons between our methods and existing methods. Figure 10 is the comparison results on the UNSW-NB15 data set. Figure 11 is the comparison results on the AWID data set.
From Fig. 10, we can see that the method using FSDAM achieved the highest value of median classification accuracy (black dashed line on the box). In Fig. 11, the accuracy distribution of the method using FSDAM is at a higher position. The two figures show that the method using FSDAM can more accurately judge the network security situation. Looking at the two figures, we can also see that global attention cannot significantly      improve the accuracy of one-hot+GRU, or even reduce the accuracy of one-hot+RNN.
It may indicate that global attention does not improve feature representation very well.
In each table, we can also find that compared with other methods, the method using FSDAM has higher mean and maximum values, which indicates that FSDAM is a relatively stable method.
Looking at all the results, FSDAM is more suitable for network security situation prediction. Therefore, we recommended FSDAM+RNN and FSDAM+GRU to predict the security situation.

Conclusion
In this paper, we proposed a network security situation prediction method based on feature separation and dual attention mechanism. The experimental results show that the method is helpful to improve the accuracy of situation prediction, and the feature separation (FS) method used in the paper is helpful to alleviate the overfitting problem and reduce cost of model training. In the next step of research, the Transformer's [21] ability in time series prediction should be paid attention to, and its application in situation prediction needs further research.