An improved content-based outlier detection method for ICS intrusion detection

Due to the complexity of industrial control systems and the diversity of protocols in networks, it is difficult to build intrusion detection models based on network characteristics and physical modeling. In order to build a better flow model without additional knowledge, we propose an intrusion detection method based on the content of network packets. The construction of the model is based on the idea of ZOE method. The similarity between flows is calculated through the sequential coverage algorithm, the normal flow model is established by multi-layered clustering algorithm, and the Count-Mean-Min Sketch is used to store and count the flow model. By comparing the unknown flow with the constructed normal flow model, we achieve the intrusion detection of industrial control system (ICS). The overall experimental results on 4 ICS datasets show that the improved method can effectively improve the detection rate and reduce the false-positive rate. The detection rate reached 96.7% on average, and the false-positive rate reached 0.7% on average.


Introduction
In recent years, the wave of informatization technology has swept the world, and information communication and manufacturing technology are integrated. At present, industrial control systems (ICSs) are used not only in the field of industrial production, but also in the field of infrastructure [1]. With the integration and complementarity of informatization and industrialization, ICS gradually adopts standardized network protocols and applications connected with the Internet. Emerging technologies, such as Ethernet, embedded system, and wireless network, replace the traditional bus technology and therefore expand the space for the development of ICS [2]; however, it also brings new challenges for its intrusion detection and fault diagnosis [3,4]. In these years, some countries have suffered from damage to their ICS, including the far-reaching outbreak of the Stuxnet virus in Iran [5]; the attack on Ukraine's power grid in 2015 [6], which threw millions of people in total darkness during the Christmas season; and invasions of the power system in Israel since 2016. Therefore, intrusion detection technology of ICS has become the focus of researchers in the field [7].
Due to resource constraints and system isolation, network security was not fully considered in the initial design of ICSs [8]. With the development of modern information system and information technology, potential security problems are gradually exposed. To ensure the stability and reliability of industrial processes, integrated circuits need to be protected. Traditional information system solutions have been rapidly applied in the field of national defense. However, these methods cannot fully detect network physical attacks under realtime and resource-constrained conditions [9]. In recent years, ICS anomaly detection and security protection have been widely studied in the world. However, the widespread use of private binary protocol in industrial networks makes content-based intrusion detection methods very complex and often makes existing methods invalid. The focus of this study is to bridge this gap and improve the performance of industrial anomaly detection methods.
The contributions of this paper are as follows: a) A content-based anomaly detection method for ICS that the performance of ICS is improved; b) A method to construct prototype models for network messages of ICS which constructs different types of traffic models through the original content of network data, without sufficient protocol knowledge or deeper understanding of the physical system.

Related works
Intrusion detection [10] is an active protection technology that detects illegal operations by analyzing the network layer data of the system. Industrial control intrusion detection systems can be divided into three types: content-based attack detection, network featurebased detection, and physical process-based detection [11].
Through analyzing protocol and traffic content, content-based attack detection constructs normal protocol or traffic model to tell the existence of attack traffic in the system from comparison between the normal traffic model and the unknown traffic. Generally, intrusion detection [12,13] is realized through reverse engineering, identification [14], analysis [15,16], and shellcode [17,18]. However, the above methods are usually based on some specific types of protocol traffic, and the detection accuracy of multi-type attacks is low. Because most network intrusions require specific code, such as buffer overflow attacks, it exhibits typical sequential patterns in network traffic. To this end, Rieck and Laskov [19] studied and discussed sequence characteristics and their efficient implementation, and used bag of token technology and q-gram method to deal with the application layer load in network traffic, thus achieving higher detection accuracy without the need to learn or mark the data in advance. Liao and Vemuri [20] first regarded symbol strings in network traffic as words in documents and applied the text classification method based on vector space model to network intrusion detection. Mahoney and Chan [21] and Ingham and Inoue [22] also used text classification method to study the intrusion detection problem. Kruegel and Vigna [23] introduced the q-gram model into intrusion detection to effectively express the closely related characters in the sequence. Lin et al. [24] proposed a novel character-level intrusion detection system based on the convolutional neural network, which treated network traffic records as character sequences and inputted each character into the convolutional neural network as a vector based on alphabetic encoding. This model has a good effect in binary classification and multiclassification. Recently, Wressnegger et al. [25] proposed a method called ZOE to detect intrusion, which effectively utilized the content-based anomaly detection framework and improved the accuracy of anomaly detection. The attack detection method based on network features is mainly to extract network traffic features through machine learning [26,27], so as to build the characteristic model of normal traffic [28]. Vavra and Hromada [29] integrated a variety of classifiers to improve the feature learning and selection methods of classifiers, thus improving ICS intrusion detection capability. In order to filter redundant or useless data in network traffic packets, Su et al. [30] proposed a feature selection method based on automatic learning machine, which can select more important features for network traffic intrusion detection. Imtiaz Ullah et al. [31] proposed a hybrid model for intrusion detection applied to SCADA (Supervisory Control And Data Acquisition) networks based on feature selection and machine learning, which effectively improved the detection accuracy.
Based on modeling of the basic physical process, the attack detection method that is based on physical process is capable to predict the data and tell the intrusion after comparing the predicted data with the actual one [32]. Wang et al. [33] proposed a method based on extended distribution state estimation to detect new types of erroneous data injection attacks in smart grids. Kurt et al. [34] modeled the smart grid system as a discrete time linear dynamic system, estimated the system state by using Kalman's filter, and realized the rapid detection of network attacks through accumulation and algorithm.
However, if there is no effective protocol specification to assist network data preprocessing, it is difficult to deeply analyze in-depth network packet contents. On the other hand, it is difficult to build a physical process-based intrusion detection system, which requires a full understanding of the system flow. Moreover, it is also not ideal for detecting multiple types of attacks. Therefore, this paper proposes a content-based improved ZOE intrusion detection method, which constructs different types of traffic models through the original content of network data, without sufficient protocol knowledge or deeper understanding of the physical system. In addition, by integrating the sequence coverage similarity algorithm [35] with the multi-layered clustering algorithm [36], the accuracy of model construction and intrusion detection in ICS is improved. The experimental results show that the model can effectively improve the intrusion detection effect in ICS.

Sequence coverage similarity algorithm
Initializing a dictionary A, A * is defined as a set of sequences of all elements on A. For any subset B ⊆ A * , B sub is extracted through B ∪ A , and the multisets M (B sub ) of set B sub can be obtained.
For an element e in M (B sub ) and a sequence s in A * , e can be called partial coverage of s if and only if the following two conditions are met: All subsequences in e are subsequences in s. Indistinguishable copies of a particular element in e correspond to distinct occurrences of the same subsequence in s.
If e can completely cover s, it means that an optimal set e * B (s) can be found from the permutation and combination of elements in e, which means that s can be covered by the least number of elements. The coverage similarity can be defined through the above steps: For any pair of sequences s 1 and s 2 , the similarity between them is: The greater the value of ϕ seq , the more similar s 1 and s 2 are, and vice versa.

The ZOE method
The ZOE method is a content-based anomaly detection method, which uses the n-gram method in natural language processing [37] to map flows to the corresponding feature vector, which has a good interpretability for the content of flows. K-means clustering [38] is used to construct protocol models and flow models of different categories without supervision, and the comparison between the established model and the current input flow could be used to determine whether there is an anomaly. A flow set D containing only normal flows is used to construct the normal flow model. For any flow d i in D, use n-gram to extract all substrings subs i l of length n in d i . Do this for each flow in D to obtain the substring set S composed of all substrings in D. Then, d i can be mapped to the corresponding eigenvector by S: where the function of occ is to calculate the frequency of the lth substring subs i l in d i . The value of each dimension in the feature vector is the frequency of each substring in S. For D = {d 1 , d 2 , · · · , d n } can use this way into the corresponding feature vector set The flow set D is transformed into the corresponding eigenvector set V , and then, different types of normal flow models are constructed by clustering the eigenvector in V . Input the number of cluster classes k and initialize k cluster classes C 1 , C 2 , · · · , C k . The eigenvector v l is classified as the corresponding cluster class j by the construction function j = arg max prox i∈ [1,k] where w is the eigenvector of cluster class C. Each cluster class can be regarded as the flow model of the corresponding type.
Since the dimension of eigenvectors obtained from n-gram processing is relatively large, the substrings in each cluster class are pruned after the completion of clustering. Set the threshold value t. If the frequency P m,subs i l of a substring subs i l is less than t for all flows in cluster C m , the substring can be eliminated to achieve the dimension reduction and denoising, so as to optimize the constructed flow model.
After completing the construction of the flow models, we can judge whether any flow is abnormal or not. Set the abnormal threshold T, and for any flow m with unknown properties, the outlier score of each flow model is calculated: where p is the number of substrings extracted by n-gram method for flow m. The above formula expressed the inconsistent degree of m with flow model C, also is the probability of m / ∈ C. If score (m, C) ≥ T, flow m is abnormal.

ICS intrusion detection based on the improved ZOE method
The framework of the ZOE method makes use of content-based anomaly detection for proprietary binary protocols, because the content models are very well usable for environments that rely on undocumented protocols with high-entropy data. On the other hand, the framework of the ZOE method introduces the concept of the prototype models which characterize not only the structure of message types but also the data they typically contain. To this end, we introduce the framework of the ZOE method and the concept of prototype models in it. The original ZOE method used n-gram method to extract substrings and took the frequency of each substring as the corresponding value of each dimension in the feature vector. However, n-gram essentially divides the flow into isolated units to be processed, and this processing mode corresponds to discrete one-hot vectors in mathematical form, which cannot consider the internal connection. Moreover, it is also very critical for the value of n. When n increases, more constraint information will appear on the next symbol, with greater discrimination. When n decreases, more substrings will appear, with more reliable statistical information. On the other hand, the original ZOE method adopted kmeans to cluster the feature vectors. Since k-means is a supervised learning algorithm, some prior knowledge is required to set a value for k that can achieve better results. But in practical applications, it is usually impossible to directly determine the number of flow models to be built, and a proper k value can only be determined through continuous trials and experiments. Moreover, using the original k-means cannot make the classifier of the training diversified, thus leading to the data difficult to identify and classify.
The multi-layered cluster can build a machine learning model that learns from nonlabeled or partially labeled data. So, it has the capability to learn from partially labeled data while achieving a detection performance comparable to that of supervised machine learning-based intrusion detection and prevention system. Therefore, this paper applies the sequential overlay similarity algorithm to the similarity calculation of flow. The similarity between flows is calculated based on the original flow, the contents in the original flow are fully considered, and the multi-layered clustering algorithm combined with the sequence coverage similarity algorithm is used to cluster the flow, so as to construct the flow model based on the original flow. Intrusion detection is carried out by the flow models.
A flow set D containing only normal flow is used to construct the normal flow model, and the flow in D is multi-layer clustered. The flows in D are divided into marked data and unmarked data. Each piece of marked data has a class label indicating what type of (2020) 2020:103 Page 6 of 15 flow it is (e.g., TCP, UDP, binary, text), while unmarked data does not have any class labels. Labeled data is denoted as D labeled , unlabeled data is denoted as D unlabeled , and flow set D = D labeled, D unlabeled . Specifically, D labeled = (d 1 , y 1 ) , (d 2 , y 2 ) , · · · , (d n , y n ) where n is the number of labeled data D labeled and y is the corresponding label where N is the amount of data in the dataset D. Clusters are generated at different k values on different layers using different sets of initialization parameters. If there is an L layer, the cluster generated on this L layer can be represented as: C 1,1 , · · · , C 1,k 1 , · · · , C L,1 , · · · , C L,k L which contains three types of clusters, namely the fully labeled cluster, the partially labeled cluster, and the unlabeled cluster. The multi-layer cluster then identifies the three types of clusters and builds a learning model on each cluster. The learning model built on each layer can be regarded as a different basic classifier, which can be utilized to build an integrated model covering the whole decision space. The final label of each flow is determined by the corresponding classifier with the most votes on different layers.
Take the lth layer for example: First, a dictionary A is initialized with the flow set D, which contains all the numbers or letters that appeared in D. A * is the set of substrings of length n composed of all elements on A. Any flow d i in flow set D is divided into substrings subd i = subd i 1 , · · · , subd i l of length n. The similarity degree dist (d i , d n ) between subd i and other flow substring set subd n is calculated by the sequence coverage similarity algorithm.
θ subd i , subd n = 1 l l k=1 ϕ seq subd i k , subd n k (10) By constructing the function j = * arg max prox i∈ [1,k] (d i , C i ), the flow d i is classified as the corresponding cluster class j in the lth layer.
where d n is the flow in cluster class C. Thus, the clustering of the lth layer is completed. After the above operations, the clustering result: C 1,1 , · · · , C 1,k 1 , · · · , C L,1 , · · · , C L,k L can be obtained from each layer (Fig. 1). By selecting category C i,1 , · · · , C i,k L which has the highest number of votes among all the decisions at all layers, as the final classification of the flow. In this way, the construction of the flow models is completed, and the nature of any unknown type of flow can be determined accordingly.
In order to facilitate the subsequent judgment of the property of any flows, a data structure is needed to store the flows in the models constructed above and count the flow in each model. Since the estimated result of Count-Min Sketch [39] is always not less than the actual value, and noise may be generated in the process of querying the flow, Count-Mean-Min Sketch [40] can be used for calculation. The use of Count-Mean-Min Sketch is more extensive. It reduces the collision probability during the flow storage and filters some noises. The diagram of Count-Mean-Min Sketch is shown in Fig. 2.
where prox * (m, C i ) represents the similarity between flow m and flow model C i . If the maximum value of the similarity between the flow and all models is less than T, it can be  for the flows in D do 4: j = arg max prox * i∈ [1,k] (d i , C i ); 5: Add d i to C j ; 6: end for 7: end for 8: Choose the C i,1 , · · · , C i,k L with the highest number of votes among all decisions at all layers, C i,1 , · · · , C i,k L are the final flows' models C; 9: Use Count-Mean-Min Sketch to store C; 10: For the unknown flow m, calculate the score * (m, C); 11: if score * ≥ T then 12: return abnormal, C; 13: else 14: return normal, C; 15: end if judged as abnormal flow; otherwise, it is normal flow. When the opposite value is taken for prox * (m, C i ), it becomes score * ≥ T, and the flow is abnormal.

The dataset
In order to verify the improved ZOE method, this paper uses the industrial control intrusion detection standard dataset [41] established by Mississippi State University (MSU) in 2014 and three public datasets [42,43] to test and verify the method. These datasets contain the original record of the process parameter values and associated labels that indicate whether the flow is normal or abnormal.
The dataset established by MSU is derived from the network layer data of the natural gas pipeline control system. The researchers used 28 types of attacks to break into the industrial control system while using a network data recorder to monitor and store data collected the Modbus flow from RS-232. Each piece of data in the dataset is a sequence record of 27 dimensions, the first 26 dimensions represent 26 different eigenvalues, and the last one dimension represents 1 classification label, of which 26 characteristics are shown in Table 1. Classification labels represent different forms of attack, as shown in Table 2. All data has been numerically processed and can be divided into four categories of attack data: reconnaissance attack, command injection attack (MSCI, MPCI, and MFCI), denial of service attack, and response injection attack (NMRI and CMRI). Two of the three public datasets use single-hop and multi-hop topologies. Both were collected from an outdoor real wireless sensor network, for a duration of 6 h, and contain two process parameters (temperature and humidity). Each dataset has a tiny partition that marks the danger state. These datasets are called SORD (Single-hop Outdoor Real-time Data) and MORD (Multi-hop Outdoor Real-time Data). Each dataset is further divided

The evaluation criteria
In this paper, the application effect of the method is evaluated by the detection rate and false-positive rate. The detection rate is the ratio of the number of abnormal flows correctly identified in the dataset to the total number of abnormal flows.
The false-positive rate represents the ratio of the number of normal flows marked as abnormal flows to all normal flows.
where TP (true positive) refers to the number of abnormal flows that have been correctly detected, FN (false negative) refers to the number of abnormal flows that have occurred but has not yet been detected, FP (false positive) refers to the number of normal flows that have been incorrectly marked as abnormal, and TN (true negative) refers to the number of normal flows that have been correctly identified. In addition, in order to reflect the sensitivity of the method to abnormal flow, this paper also uses the hazard score [44] to compare the ZOE method with the improved ZOE method, which is defined as follows: the hazard score of a data point can be regarded as the distance from the hazard data point. The smaller the distance, the higher the hazard score. Conversely, the greater the distance, the less dangerous it is, the less sensitive it is to the abnormal flow, and the slower it responds to the abnormal. It is calculated by the following formula: where n is the number of dangerous data points in the dataset, and p is the number of dangerous data points in the first n data points of the hazard score.

Experiment 1: Compare the counting effect with different data structures
The Count-Mean-Min Sketch data structure is adopted to store and count the flow, and d = 7 hash functions are used to map the flow. The same setting applies to the Count-Min Sketch data structure. The approximate count and the difference between the actual values obtained from the two data structures were statistically analyzed. The experimental results are shown in Figs. 3 and 4, respectively. The horizontal axis represents the difference between the approximate count and the actual value, and the vertical axis represents the relative frequency of each difference. It can be seen that when the Count-Mean-Min Sketch is used, the relative frequency of the approximate value equal to the actual value increases, which shows that when Count-Mean-Min Sketch is used to count the flow, the probability of its value is equal to the actual value increases, thus improving the accuracy of similarity calculation. In addition, compared with the difference distribution range of Count-Min Sketch, the difference value of Count-Mean-Min Sketch is relatively concentrated in a smaller range, which indicates that in the case of Count-Mean-Min Sketch, even if errors still occur, the error range is narrowed and the overall accuracy is improved compared with Count-Min Sketch.

Experiment 2: Compare the detection effect of abnormal flow
The improved ZOE method and the original ZOE method in this paper are applied to the same four datasets. In addition, we compare our method with two related methods proposed in recent years. These two methods are data-driven clustering [44] and improved K-means clustering [45], respectively. The results shown in Table 3 are obtained in terms of detection rate and false-positive rate.
In MSU datasets with more diverse data types, compared with the original method, the improved ZOE method improved the detection rate of abnormal data without increasing the false-positive rate, in which the detection rate was close to 0.96. The detection rate on the MORD dataset increased by 0.02, with the most significant improvement. When applied to the SORD dataset and DUWWTP dataset, although the improvement effect of detection rate is less obvious, the false-positive rate of abnormal flow detection in the dataset is reduced.
Therefore, when the improved ZOE method is applied to the above four datasets, it not only improves the detection rate, but also reduces the false-positive rate. While in some datasets the detection rate does not increase significantly, the false-positive rate is reduced.

Experiment 3: Compare the hazard score
Calculate the hazard score of abnormal flow when the improved ZOE method and the original ZOE method are applied to the same four datasets above. The number of the first n hazard data points is calculated respectively, and the formula is applied to evaluate the results to obtain the comparison result of the hazard score, as shown in Table 4.
During the experiment, the ZOE method and improved ZOE method are used to calculate the hazard score for each data point in the dataset. The improved method increased the hazard score on all four datasets. It reflects that the improved method is more sensitive to the abnormal flow and can respond to the abnormal more quickly. Among them, the sensitivity of DUWWTP dataset improved most obviously, while the sensitivity of other datasets also improved by 0.01 to 0.02. Therefore, compared with the original ZOE method, the improved ZOE method improves the sensitivity to abnormal flow and enhances the response speed to abnormal flow when the exception occurs.

Experiment 4: The ROC curve of improved ZOE method
In the course of this evaluation, we describe detection performances with the aid of the receiver operator characteristics (ROC) and corresponding ROC curves. These curves plot the true-positive rate over the false-positive rate of a detector for different thresholds.
The Fig. 5 tells that the closer the ROC curve is to the upper left corner corresponds to fewer classification errors. In other words, the larger the area under ROC curve (AUC) equals a better classification effect. We can see that the method proposed in this paper quickly approaches the upper left corner. Additionally, we use the AUC as a single continuous measure for the detection performance that yields a minimal and maximal value of 0.0 and 1.0, respectively. And the AUC value of the improved ZOE method can reach 0.98.

Conclusion
This paper introduces the ZOE method and analyzes the defects and shortcomings of the method. For example, it is difficult to select parameters by using n-gram method and the accuracy of eigenvector clustering by using original k-means algorithm is insufficient. Then, sequential coverage similarity algorithm is proposed to calculate the similarity between any two flows in the industrial control system. The SMLC model is used to construct the normal flow model, and the sequence coverage similarity is used as the measure of the distance between the clusters. Based on the original ZOE method, an improved intrusion flow detection model of industrial control system is proposed by combining SMLC hierarchical clustering with sequence coverage similarity. The improved model is used to construct a higher quality normal flow model, which improves the accuracy of abnormal flow detection and reduces the false-positive rate. And the improved model also improves the sensitivity to the abnormal flow in the ICS and increases the response speed to the abnormal flow when it occurs. Moreover, the Count-Mean-Min Sketch is used to store the normal flow models. Compared with the difference distribution range of Count-Min Sketch, the difference of Count-Mean-Min Sketch is relatively concentrated in a smaller range. The probability of the value equal to the actual value increases, which improves the accuracy of the model for intrusion detection in industrial control systems.