The framework of the ZOE method makes use of content-based anomaly detection for proprietary binary protocols, because the content models are very well usable for environments that rely on undocumented protocols with high-entropy data. On the other hand, the framework of the ZOE method introduces the concept of the prototype models which characterize not only the structure of message types but also the data they typically contain. To this end, we introduce the framework of the ZOE method and the concept of prototype models in it.
The original ZOE method used n-gram method to extract substrings and took the frequency of each substring as the corresponding value of each dimension in the feature vector. However, n-gram essentially divides the flow into isolated units to be processed, and this processing mode corresponds to discrete one-hot vectors in mathematical form, which cannot consider the internal connection. Moreover, it is also very critical for the value of n. When n increases, more constraint information will appear on the next symbol, with greater discrimination. When n decreases, more substrings will appear, with more reliable statistical information. On the other hand, the original ZOE method adopted k-means to cluster the feature vectors. Since k-means is a supervised learning algorithm, some prior knowledge is required to set a value for k that can achieve better results. But in practical applications, it is usually impossible to directly determine the number of flow models to be built, and a proper k value can only be determined through continuous trials and experiments. Moreover, using the original k-means cannot make the classifier of the training diversified, thus leading to the data difficult to identify and classify.
The multi-layered cluster can build a machine learning model that learns from non-labeled or partially labeled data. So, it has the capability to learn from partially labeled data while achieving a detection performance comparable to that of supervised machine learning-based intrusion detection and prevention system. Therefore, this paper applies the sequential overlay similarity algorithm to the similarity calculation of flow. The similarity between flows is calculated based on the original flow, the contents in the original flow are fully considered, and the multi-layered clustering algorithm combined with the sequence coverage similarity algorithm is used to cluster the flow, so as to construct the flow model based on the original flow. Intrusion detection is carried out by the flow models.
A flow set D containing only normal flow is used to construct the normal flow model, and the flow in D is multi-layer clustered. The flows in D are divided into marked data and unmarked data. Each piece of marked data has a class label indicating what type of flow it is (e.g., TCP, UDP, binary, text), while unmarked data does not have any class labels. Labeled data is denoted as Dlabeled, unlabeled data is denoted as Dunlabeled, and flow set D={Dlabeled,Dunlabeled}. Specifically,
Dlabeled={(d1,y1),(d2,y2),⋯,(dn,yn)} where n is the number of labeled data Dlabeled and y is the corresponding label
Dunlabeled={(dn+1),(dn+2),⋯,(dN)} where N is the amount of data in the dataset D. Clusters are generated at different k values on different layers using different sets of initialization parameters. If there is an L layer, the cluster generated on this L layer can be represented as:
\(\left \{ {{C_{1,1}}, \cdots,{C_{1,{k_{1}}}}} \right \}, \cdots,\left \{ {{C_{L,1}}, \cdots,{C_{L,{k_{L}}}}} \right \}\)
which contains three types of clusters, namely the fully labeled cluster, the partially labeled cluster, and the unlabeled cluster. The multi-layer cluster then identifies the three types of clusters and builds a learning model on each cluster. The learning model built on each layer can be regarded as a different basic classifier, which can be utilized to build an integrated model covering the whole decision space. The final label of each flow is determined by the corresponding classifier with the most votes on different layers.
Take the lth layer for example: First, a dictionary A is initialized with the flow set D, which contains all the numbers or letters that appeared in D. A∗ is the set of substrings of length n composed of all elements on A. Any flow di in flow set D is divided into substrings \(sub{d^{i}} = \left \{ {subd_{1}^{i}, \cdots,subd_{l}^{i}} \right \}\) of length n. The similarity degree dist(di,dn) between subdi and other flow substring set subdn is calculated by the sequence coverage similarity algorithm.
$$ dist\left({{d_{i}},{d_{n}}} \right) = \theta \left({sub{d^{i}},sub{d^{n}}} \right) $$
(9)
$$ \theta \left({sub{d^{i}},sub{d^{n}}} \right) = \frac{1}{l}{\sum\nolimits}_{k = 1}^{l} {{\varphi_{seq}}} \left({subd_{k}^{i},subd_{k}^{n}} \right) $$
(10)
By constructing the function \(j = {\mathop {\arg \max prox}\limits _{i \in \left [ {1,k} \right ]}^{*}}\left ({{d_{i}},{C_{i}}} \right)\), the flow di is classified as the corresponding cluster class j in the lth layer.
$$ pro{x^{*}}:{d_{i}},C \to \frac{1}{{\left| C \right|}}{\sum\nolimits}_{{d_{n}} \in C} {dist\left({{d_{i}},{d_{n}}} \right)} $$
(11)
where dn is the flow in cluster class C. Thus, the clustering of the lth layer is completed. After the above operations, the clustering result:
\(\left \{ {{C_{1,1}}, \cdots,{C_{1,{k_{1}}}}} \right \}, \cdots,\left \{ {{C_{L,1}}, \cdots,{C_{L,{k_{L}}}}} \right \} \)
can be obtained from each layer (Fig. 1). By selecting category \(\left \{ {{C_{i,1}}, \cdots,{C_{i,{k_{L}}}}} \right \}\) which has the highest number of votes among all the decisions at all layers, as the final classification of the flow. In this way, the construction of the flow models is completed, and the nature of any unknown type of flow can be determined accordingly.
In order to facilitate the subsequent judgment of the property of any flows, a data structure is needed to store the flows in the models constructed above and count the flow in each model. Since the estimated result of Count-Min Sketch [39] is always not less than the actual value, and noise may be generated in the process of querying the flow, Count-Mean-Min Sketch [40] can be used for calculation. The use of Count-Mean-Min Sketch is more extensive. It reduces the collision probability during the flow storage and filters some noises. The diagram of Count-Mean-Min Sketch is shown in Fig. 2.
value is the flow, d is the depth, and w is the width. Each flow is processed by dhash functions hi to determine location pi=hi(value),i∈[1,d],pi∈[0,w−1]. For any flow m, apply hash mapping and apply the result value to the position in the corresponding row. Using the hash function to determine the value associated with the flow m. The minimum of these values is approximate to the truth value, that is, the approximate value of |C|.
Set the abnormal threshold T, calculate the outliers of any flow m for each flow model, and judge whether the flow is abnormal by the outliers.
$$ \text{score}^{*}:m,C \to \mathop {\min }\limits_{i} {d^{*}}\left({m,{C_{i}}} \right) $$
(12)
$$ {d^{*}}:m,C \to 1 - pro{x^{*}}\left({m,{C_{i}}} \right) $$
(13)
where prox∗(m,Ci) represents the similarity between flow m and flow model Ci. If the maximum value of the similarity between the flow and all models is less than T, it can be judged as abnormal flow; otherwise, it is normal flow. When the opposite value is taken for prox∗(m,Ci), it becomes score∗≥T, and the flow is abnormal.