Results and discussion of MWM algorithm
In order to better test the performance of MWM algorithm and the current existing STRFINDUNI algorithm, these two algorithms are separately extracted to put in the same Linux environment for testing.
The hardware environment of the test: Server Intel (R) Xeon (R) CPU 5120@ 1.86 GHz 4 kernel 4 G memory.
Test samples: SOGOU Chinese Thesaurus, the test samples are taken from the thesaurus.
Test method: The network data matching process is simulated in real environment, without any text change, and the number of network data pattern strings progressively increases in a linear form, resulting in the proportional change of the hit times. The initialization time consumption and time consumption of network data matching of these two algorithms are respectively measured as in Figs. 2 and 3, respectively.
According to the comparison results made between Figs. 2 and 3, it is obvious that the MWM algorithm spends much more time than the STRFINDUNI algorithm both in the initialization process and in the pattern matching process. The time consumption of MWM algorithm is far more than that of STRFINDUNI algorithm regardless of initialization process or pattern matching process. However, in terms of the extensive application of such algorithm in snort application, the main factors of researches in this paper cause the above differences to be concluded and summarized as follows:
Influence factors concerning the network data service client matching:
-
(a)
Different HASH methods: HASH in STRFINDUNI applies the 1-stage HASH followed by two stages of indexing to construct. During the search process, the first 6 bytes of the head address of chain list of the network data service client pattern string list are found through the screening process and then matched to the network data service client pattern string in the linked list one by one. However, the HASH process in MWM uses only 1-stage HASH level to accelerate the finding, matching them one by one in the network data service client pattern string array. This difference in HASH approach is magnified by testing with dictionary files because the number of phrases that begin with the same first 2 bytes is too large, often up to more than 100, and the number of words beginning with the same first 6 bytes is greatly reduced, usually less than three.
-
(b)
Different basic units of matching of network data: Because STRFINDUNI is used to match the text with Unicode coding method, the basic unit of matching is the word, and the shifting, HASH, and other operations in the algorithm are carried out in the word unit. However, MWM is carried out in byte, and for MWM algorithm known for reducing the number of comparison times by jumping, the differences in the basic unit of pattern string matching for such network data will at least cause a half distance reduction of movement in jumping.
Influence factors concerning initialization:
-
(a)
Influence of sorting operation. After the pattern string of network data is read in the MWM, all the network data service client mode strings are quickly sorted. This operation is not obvious when the number of pattern strings is small, but when the number of pattern strings of network data is very large, this operation will seriously affect the initialization speed. The most important thing is that such a large overhead operation is not well used later.
-
(b)
The overhead for pattern string copy of network data. The pattern string of STRFINDUNI is read during initialization, and the pattern string of network data is stored in the form of memory copy. However, the pattern string of network data in MWM is only carried out by the pointer passing, reducing the overhead of application for memory and copy.
Results and discussion of TMWM algorithm
Considering the above defects is not a serious problem in the idea of MWM algorithm itself, but rather, the idea of MWM algorithm is not fully utilized in a specific environment and under specific conditions. Therefore, under the premise of retaining the core idea of the algorithm, this paper proposes an improved algorithm TMWM, and the main modified items are as follows:
-
(a)
Modification of basic unit of matching of network data. The jump table of MWM algorithm is carried out with word as the unit, and the jump operation is completely adopted in the form of word jumping.
-
(b)
Modification of HASH method. The pattern string array of network data upon sorting is fully utilized and two-stage index is added behind HASH table with the minimum cost in combination with two index arrays, and the retrieval of each stage index is carried out by means of binary search.
-
(c)
Modification of sorting operation. Because the index operation and pattern string comparison operation of specific network data only use the first 6 bytes of the pattern string array, it is rather redundant for the quick sorting of the whole pattern string array, and in terms of later application, only the first 6 bytes need to be sorted.
In order to more clearly display the main retrieval process, all table items in Fig. 4 only display two data items, including the number required to be matched at the next stage and the initial positioning of matching at the next stage.
In order to verify the effectiveness of the positioning method of network data with high performance and multi-pattern matching based on pattern string TMWM proposed in this paper, the simulation experiments require to be carried out to compare with the traditional STRFINDUNI algorithm, and the main influence factors of these two algorithms include: number of pattern strings of network data, the size of text of network data, the mean length of pattern string of network data, and the hit times of network data.
In the following, four parameters, including the number of pattern strings, text size, mean length of pattern string, and the hit times, are utilized for testing and corresponding analysis.
-
(a)
Number of pattern strings
Test method: The effect of number of the tested pattern string on the matching time and the initialization time is tested under the condition that the mean length of pattern string is unchanged (6 bytes) with unchanged text size and basically constant hit times.
The results are respectively shown in Figs. 5 and 6. As can be seen from the comparison of Figs. 5 and 6, the initialization time of the TMWM algorithm is proportional to the number of pattern strings under other conditions unchanged, whereas the matching time slightly increases with the number of pattern strings.
-
(b)
Text size
Test method: The effect of text size on the matching time and initialization time is tested under the condition that the pattern strings are constant and the hit number of times is basically unchanged.
The results are respectively shown in Figs. 7 and 8. As can be seen from the comparison of Figs. 7 and 8, the matching time of these two algorithms is proportional to the text size under other conditions unchanged, and the initialization time remains constant because the pattern string remains unchanged.
-
(c)
Average length of pattern strings
Test method: The effect of the average length of the pattern string on the matching time and initialization time is tested when the number of pattern strings, the text size, and the number of hit times remain unchanged.
The results are respectively shown in Figs. 9 and 10. As can be seen from the comparison of Figs. 9 and 10, the TMWM algorithm is less affected by the length of the pattern string, whereas the STRFINDUNI algorithm is greatly affected by the length of the pattern string, and it has a big time change especially when matching.
-
(d)
Hit times
Test method: The effect of the hit number of times on the matching time and initialization time is tested when the number of pattern strings, the average length of the pattern strings, and the text size remain unchanged.
The results are respectively shown in Figs. 11 and 12. According to the results of the effects of hit number of times on the initialization process in Figs. 11 and12, the TMWM algorithm has a relatively small effect on the hit number of times, and the invariant pattern strings lead to no change in the initialization time, so the initialization time of the TMWM algorithm is improved by about 50% as compared with the STRFINDUNI algorithm and the matching time of the TMWM algorithm is increased by about 100% as compared with the STRFINDUNI algorithm.
The above results are obtained through a lot of experiments and analysis. Meanwhile, as the price of hardware resources decreases and more complex regular expressions are used, AC algorithm will have more applications.