A Novel Approach to Detect Network Attacks Using G-HMM-Based Temporal Relations between Internet Protocol Packets

. This paper introduces novel attack detection approaches on mobile and wireless device security and network which consider temporal relations between internet packets. In this paper we ﬁrst present a ﬁeld selection technique using a Genetic Algorithm and generate a Packet-based Mining Association Rule from an original Mining Association Rule for Support Vector Machine in mobile and wireless network environment. Through the preprocessing with PMAR, SVM inputs can account for time variation between packets in mobile and wireless network. Third, we present Gaussian observation Hidden Markov Model to exploit the hidden relationships between packets based on probabilistic estimation. In our G-HMM approach, we also apply G-HMM feature reduction for better initialization. We demonstrate the usefulness of our SVM and G-HMM approaches with GA on MIT Lincoln Lab datasets and a live dataset that we captured on a real mobile and wireless network. Moreover, experimental results are veriﬁed by m -fold cross-validation test.


Introduction
The world-wide connectivity and the growing importance of internet have greatly increased the potential damage, which is inflicted by attacks over the internet.One of the conventional methods for detecting such attacks uses attack signatures that reside in the attacking program.The method requires human management to find and analyze attacks, make rules, and deploy the rules.The most serious disadvantage of these signature schemes is that it is difficult to detect the unknown and new attacks.Anomaly detection algorithms use a normal behavior model for detecting unexpected behaviors as measures.Many anomaly detection methods have been researched in order to solve the signature schemes problem by using machine learning algorithms.There are two categories of machine learning for detecting anomalies; supervised methods make use of preexisting knowledge and unsupervised methods do not.Several efforts to design anomaly detection algorithms using supervised methods are described in [1][2][3][4][5].The researches of Anderson at SRI [1,2] and Cabrera et al. [3] deal with statistical methods for intrusion detection.Lee and Xiang's research [4] is about theoretical measures for anomaly detection, and Ryan [5] uses artificial neural networks with supervised learning.In contrast, unsupervised schemes make appropriate labels for a given dataset automatically.Anomaly detection methods with unsupervised features are explained in [6][7][8][9][10].MINDS [6] is based on data mining and data clustering methods.The researches of Eskin et al. [7] and Portnoy et al. [8] were used to detect anomaly attacks without preexisting knowledge.
Staniford et al. [9] is the author of SPADE for anomaly port scan detection in Snort.SPADE used a statistical anomaly detection method with Bayesian probability.Ramaswamy et al. [10] use outlier calculation with data mining.
However, even if we use good anomaly detection methods, there are still difficult problems to select proper features and to consider the relations among inputs in a given problem domain.Basically, the feature selection is a kind of optimization problem.So far many successful feature selection algorithms have been devised.Among them, genetic algorithm (GA) is known as the best randomized heuristic search algorithm for feature selection.It uses Darwin's evolution concept to progressively search for better solutions [11,12].Moreover, in order to consider the relationships between the packets, we first have to understand a characteristic of the given problem domain-then we can apply an appropriate method, which can associate the characteristics like using a mining association rule (MAR).
In this paper, we propose a feature selection method based on a genetic algorithm (GA) and two kinds of temporal based machine learning algorithms to derive the relations between packets as follows: support vector machine (SVM) with packet-based mining association rule (PMAR) and Gaussian observation hidden Markov model (G-HMM).PMAR method uses a data preprocessing for calculating temporal relations between packets based on the mining association rule (MAR).An SVM is the best training algorithm for learning classification from data [13].The main idea of SVM is to derive a hyperplane that maximizes the separating margin given two classes.However, in SVM learning, one of the serious disadvantages is that it is difficult to deal with consecutive variation of learning inputs without additional preprocessing, which is why we propose an approach to improve SVM classification using PMAR method.The other approach is to use G-HMM [14].If we assume that internet traffic has continuous distribution like Gaussian distribution, G-HMM approach among various HMMs can be applied to estimate hidden packet sequences and can evaluate abnormal behaviors using Maximum Likelihood (ML).In addition, we concentrate on novel attack detection in TCP/IP traffic because TCP/IP accounts for about 95% of all internet traffic [15,16].Thus, the main contribution of this paper is to propose temporal sequencebased approach using G-HMM in comparison with SVM methods.Through the machine learning approaches like GA, we verify the main proposed approach using MIT Lincoln Lab dataset.
The rest of this paper is organized as follows.In Section 2, our overall framework describes an optimized feature selection using GA, a data preprocessing using PMAR for SVMs, HMM reduction method for G-HMM, training and testing with SVMs and G-HMM approaches, and verifying with the m-folding validation method.In Section 3, GA technique is described.In our genetic approach, we make our own evolutionary model by three evolutionary steps, and we pinpoint the specific derivation of our own designed evaluation equation.In Section 4, we present SVM learning approaches with PMAR.SVM approaches are for both supervised learning with soft margin to classify nonseparable classes and an unsupervised method with one-class classifier.The PMAR-based SVM approaches can be applied to time series data.In Section 5, we present G-HMM learning approach among HMM models.In our G-HMM approach, the observation sequences of internet traffic are shown as Gaussian distribution among many continuous distributions.Moreover, we use HMM feature reduction for data normalization during the data preprocessing for G-HMM.In Sections 6 and 7, experimental methods are explained with the description of datasets and parameter settings.In the experiment results section, we analyze feature selection results, comparison between SVMs versus G-HMM, and cross-validation results.In the last section, we conclude and give some recommendation for future work.

Overall Framework
Figure 1 illustrates the overall framework of our machine learning approach considering temporal data relations of internet packets.This framework has four major components as follows.The first component includes offline field selection using GA.GA selects optimized packet fields through the natural evolutionary process.The selected fields are then applied to the captured packets in real time through packet capture tool.The second component is a data preprocessing to refine the packets for the high correction performance with PMAR and an HMM reduction method.PMAR is based on mining association rule for extracting the relations between packets.Moreover, the HMM reduction method is used to decrease the number of its input features to prevent G-HMM from having worse initialization.The third component is our key role which establishes temporal relations between packets based on SVM and G-HMM.In SVM model, we use soft margin SVM as a supervised SVM and one-class SVM as an unsupervised SVM.Even though soft margin SVM has relatively better performance, it needs labeled knowledge.In other words, one-class SVM can distinguish outliers without preexisting knowledge.In HMM model, we use G-HMM model to estimate hidden temporal sequences between packets.Our G-HMM makes the packet distribution of internet as the Gaussian distribution.Using this process, G-HMM will also calculate ML to evaluate anomaly behaviors.Finally, our framework is verified by m-fold cross-validation test.An m-fold cross-validation is the standard technique used to obtain an estimation of a method's performance over unseen data.

Field Selection Approach Using GA
GA is a model to mimic the behavior of the evolution process in nature [11,17].It is an ideal technique to find a solution of an optimization problem.The GA uses three operators to produce the next generation from the current: reproduction, crossover, and mutation.The reproduction determines which individuals are chosen for crossover and how many offspring each selected individual produces.The selection uses a probabilistic survival of the fittest mechanism based on a problem-specific evaluation of the individuals.The crossover then generates new chromosomes within the population by exchanging part of chromosome pairs of randomly selected from existing chromosomes.Finally, the mutation allows rarely the random mutation of existing chromosomes so that new chromosomes may contain parts not found in any existing chromosomes.This whole process is repeated probabilistically, moving from generation to generation, with the expectation that, at the end, we are able to choose an individual which closely matches our desired conditions.When the process terminates, the best chromosome selected from among the final generation is the solution.
To apply evolution process to our problem domain, we have to decide the following 3 steps: individual gene presentation and initialization, evaluation function modeling, and a specific function of genetic operators and their parameters.In the first step, we transform TCP/IP packets into binary gene strings for applying genetic algorithm.We convert each field of TCP and IP header into one-bit binary gene value, "0" or "1".In this sense, "1" means that the corresponding field exists and "0" means not.The initial population consists of a set of randomly generated 24 bits strings including both 13 bits of IP fields and 11 bits of TCP fields.Additionally the total number of individuals in the population should be carefully considered because of the following reasons.If the population size is too small, all gene chromosomes will have the same gene string value soon, and the genetic model cannot generate new individuals.In contrast, if the population size is too large, the model needs to spend more time to calculate gene strings, and it affects the time to the generation of new gene string.
The second step is to make our fitness function for evaluating individuals.The fitness function consists of an object function f (X) and its transformation function g( f (X)): (1) In (1), the objective function's values are converted into a measure of relative fitness by fitness function F(X) with transformation function g(x).To describe our own objective function, we use the anomaly score and communication score shown in Table 1.In case of anomaly scores, the score refers to MIT Lincoln Lab datasets, covert channels, and other anomaly attacks [18][19][20][21][22].The scores increase in proportion to the frequency of a field being used for anomaly attacks.Communication scores are divided into three kinds of scores in accordance with their importance during a communication."S" fields have static values.For "De" fields, their value is dependent on connection status, and, for "Dy" fields, the values can change dynamically.We can derive a polynomial equation which has the abovementioned considerations as coefficients.The coefficients of the derived polynomial equation have a characteristic of a weighted summed feature.Our objective function f (X) consists of two polynomial functions A(X) and N (X) as shown in (2), (2) From (2), A(X) is our anomaly scoring function, and N (X) is our communication scoring function.Variable X is a population, X k (x i ) is a set of all individuals, and k is total number of population.x i is an individual with 24 attributes.To prevent generating too many features from (2), a bias term μ is used as follows: where μ is the bias term of new objective function f (X k (x i )), and the boundary is 0 < μ < Max ( f (X k )).In case of A(X k (x i )), we can derive the proper equation as follows: where A = {a i , . . ., a 2 , a 1 } is a set of coefficients in the polynomial equation and each coefficient represents anomaly scores.From (4), we use the bias term to satisfy condition (5).Thus, we can choose a reasonable number of features without overfitting, and we can derive the new anomaly scoring function (6) with the bias term μ A as follows: 5) As for N (X k (x i )), we also develop an appropriate function with the same derivation as in (4): where N is a set of communication scores and the coefficients α, β, γ are weights of static (S), dependent (De), and dynamic (Dy), respectively, represented in Table 1.From ( 6), we give the bias term by the same method as in ( 5) and ( 6): 8) where x α , x β , x γ are a set of elements with the coefficient α, β, γ, respectively.From ( 6) and ( 9), we can derive our entire objective equation as follows: While the relative fitness is calculated using proposed objective function (10), the fitness function F(x k ) of ( 1) has rank based on the operation.Rank-based operation overcomes the scaling problems of the proportional fitness assignment.The reproductive range is limited, so that no individuals generate an excessive number of offsprings.The ranking method introduces a uniform scaling across the population.The last step for genetic modeling is to decide a specific function of genetic operators and their related parameters.In reproduction operator, a roulette wheel method is used.Each individual has their own selection probability by means of n roulette.Roulette wheel contains one sector per each member of the population which is proportional to the value P sel (i) per one sector.If the selection probability is high, it means that more gene strings are inherited to next generation.For crossover, single crossover point method is used.This method has just one crossover point, so a binary string from the beginning of the chromosome to the crossover point is copied from the first parent, and the rest is copied from the other parent.If we use very little crossover probability, it prevents convergence to an optimized solution.Conversely, if the probability is too high, it increases the possibility that it can destroy the best solution because of gene exchange too frequently.In mutation, we use a general discrete mutation operator.If the mutation probability is too small, new characteristics will be accepted too late.If the probability is too high, new mutated generations will not have a close relationship with former generation.In Section 7, we will construct preliminary tests to determine the best parameters for our problem domain.

SVM Learning Approach Using PMAR
SVM is a type of pattern classifier based on a statistical learning technique for classification and regression with a variety of kernel functions [13,[23][24][25][26]. SVM has been successfully applied to a number of pattern recognition applications [27].Recently, SVM is also applied to information security for intrusion detection [28][29][30].SVM is known to be useful for finding a global minimum of the actual risk using structural risk minimization since it can generalize well even in high-dimensional spaces under small training sample conditions with kernel tricks.SVM can select appropriate set-up parameters because it does not depend on the traditional empirical risk like neural networks.In our SVM learning models, we use two kinds of SVM approaches as follows: soft margin SVM with a supervised feature and one-class SVM with an unsupervised feature.Moreover, PMAR technique is proposed during the preprocessing for SVM inputs.The reason we supplement PMAR technique to SVM learning is because it can reflect temporal association between packets.

Packet-Based Mining Association Rule (PMAR) for SVM
Learning.To determine the anomalous characteristics of internet traffic, it is very important not only to consider the attributes of a packet's contents but also to grasp the relations between consecutive packets.If we can pick out relations from packets, this knowledge can deeply influence the performance of SVM learning since SVM does not consider the significant meaning of input sequences.In this section we use PMAR to preprocess filtered packets before they are learned.We propose our data preprocessing method based on MAR for SVM performance, which is called PMAR.Basically MAR has proved a highly successful technique for extracting useful information from very large database.A formal statement of the association rule problem is as follows [31,32].Definition 1.Let I = {I 1 , . . ., I 2 , I m } be a set of m distinct attributes, also called literals.Let D be a database, where each record (tuple) T has a unique identifier and contains a set of items such that T ⊆ I.An association rule is an implication of the form X ⇒ Y , where X, Y ⊂ I are sets of items called itemsets and X!Y = ϕ.Here, X is called antecedent and Y consequent.
Definition 2. The support (s) of an association rule is the ratio (in percent) of the records that contain X Y to the total number of records in the database.Definition 3.For a given number of records, confidence (α) is the ratio (in percent) of the number of records that contain X Y to the number of records that contain X.
PMAR is a rule to find the relations between packets using MAR in internet traffic.Let us assume that PMAR has an association unit of a fixed size.If the fixed size is too long, then the rule can aggregate packets without a specific relation.If the fixed size is too short, the rule can fragment packets in the same relations.However, although the association unit is variable, it is also difficult to decide on a proper variable size.Therefore, we focus on a specific fixed length association unit based on the network flow.We make our network model to derive PMAR and calculate a minimum support rate: (11) where P i is a packet and {a 1 , . . ., a n } is an attribute set of P i .R j is a set of P i .C k is a connection flow.From our (11), we can derive formulations as follows: In the condition of ( 12), the N is the number of common attributes and Pattr (P i | P k ) is the number of common attributes between two packets.In the definition of (13), Rattr (P i ) is a set of R j elements which is satisfied with (12) when P i is compared with all P k in R j .If an R j in C k satisfies (14), we can say that R j is associated with C k .Finally, by mining association rule definitions [31,32] and our proposed functions ( 12)-( 14), we can derive our minimum support rate as follows: If a connection flow is not satisfied with this minimum support rate, the connection flow is dropped because the dropping means that the connection flow consists of indifferent packets or heavily fragmented packets which do not have a specific relation.
where w is an adjustable weight vector, x i is the input vector, and b is the bias term.Equivalently, In this case, we say the set is linearly separable.In Figure 2, the distance between the hyperplane and f (x) is 1/ w .The margin of the separating hyperplane is defined to be 2/ w .The learning problem is hence reformulated as minimize w 2 = w T w subject to the constraints of linear separation as in (18).This is equivalent to maximizing the distance of the hyperplane between the two classes; this maximum distance is called the support vector.The optimization is now a convex quadratic programming problem: This problem has a global optimum because Φ(w) = (1/2) w 2 is convex in w and the constraints are linear in w and b.This has the advantage that parameters in a quadratic programming (QP) affect only the training time and not the quality of the solution.This problem is tractable, but anomalies in internet traffic show a characteristic of nonlinearity and are thus more difficult to classify.In order to proceed to such nonseparable and nonlinear cases, it is useful to consider the dual problem as outlined in the following.The Lagrange for this problem is where Λ = (λ 1 , . . ., λ l ) T are the Lagrange multipliers, one for each data point.The solution to this quadratic programming problem is given by maximizing L with respect to Λ ≥ 0 and minimizing with respect to w and b.Note that the Lagrange multipliers are only nonzero when y i (w T x i + b) = 1, vectors for this case are called support vectors since they lie closest to the separating hyperplane.However, in case of nonseparable, forcing zero training error will lead to poor generalization.
To take into account the fact that some data points may be misclassified, we introduce soft margin SVM using a vector of slack variables Ξ = (ξ 1 , . . ., ξ l ) T that measure the amount of violation of the following constraints: where C is a regularization parameter that controls the tradeoff between maximizing the margin and minimizing the training error.If C is too small, insufficient stress is placed on fitting the training data.If C is too large, the algorithm will overfit the dataset.
In practice, a typical SVM approach such as the soft margin SVM showed excellent performance more often than other machine learning methods [26,33].In case of an intrusion detection application, supervised machine learning approaches based on SVM were superior to intrusion detection approaches using artificial neural networks [30,33,34].Therefore, the high classification capability and processing performance of soft margin SVM approach will be useful for anomaly detection.However, because soft margin SVM is a supervised learning approach, the labeling of the given dataset is needed.

One-Class SVM:
Unsupervised SVM.SVM algorithms can be also adapted into an unsupervised learning algorithm called one-class SVM, which identifies outliers amongst positive examples and uses them as negative examples [24].In anomaly detection, if we consider anomalies as outliers, one-class SVM approach can be applied to classify anomalous packets as outliers.
Figure 3 shows the relation between a hyperplane of one-class SVM and outliers.Suppose that a dataset has a probability distribution P in the feature space and we want to estimate a subset S of the feature space such that the probability that a test point drawn from P lies outside of S is bounded by some a priori specified value ν ∈ (0, 1).The solution of this problem is obtained by estimating a function f which is a positive function taking the value +1 in a small region, where most of the data lies, and −1 elsewhere.
The main idea is that the algorithm maps the data into a feature space H using an appropriate kernel function and then attempts to find the hyperplane that separates the mapped vectors from the origin with maximum margin.
Given a training dataset (x 1 , y 1 ), . . . , (x 1 , y 1 ) ∈ Ê N × {±1}, let Φ : Ê N → H be a kernel map which transforms the training examples into the feature space H.Then, to separate the dataset from the origin, we need to solve the following quadratic programming problem: where ν is a parameter that controls the tradeoff between maximizing the distance from the origin and containing most of the data in the region related to the hyperplane and corresponds to the ratio of outliers in the training set.Then the decision function f (x) = sgn((w • Φ(x) + b) − ρ) will be positive for most examples x i contained in the training set.In practice, even though one-class SVM has the capability of outlier detection, this approach is more sensitive to a given dataset than other machine learning schemes [24,34].It means that deciding on an appropriate hyperplane for classifying outliers is more difficult than in a supervised SVM approach.

G-HMM Learning Approach
Although the above-mentioned PMAR capability is given to SVM learning, it does not always mean the the inferred relations are reasonable.Therefore, we need to estimate more realistic association from internet traffic.Among various HMM learning approaches, we use G-HMM because G-HMM has Gaussian observation outputs in continuous probabilistic distribution.Our G-HMM approach makes a normal behavior model to estimate hidden temporal relations of packets and evaluates anomalous behavior through calculating ML.Moreover, G-HMM model has a possibility of being singular when their covariance matrix is calculating.Thus, we also need to make a better initialization when decreasing the number of features during the G-HMM data preprocessing.

G-HMM Feature Reduction.
In G-HMM learning, a mixture of Gaussians can be written as a weighted sum of Gaussian densities.The observations of each state are described by the mean value μ i and the covariance i of Gaussian density.The covariance matrix i is calculated by given input sequences.When we estimate the covariance matrix, it can often become a singular matrix in accordance with a characteristic of the given sequences.This is because each data value is too small or too few points are assigned to a cluster center due to a bad initialization of the means.In case of internet traffic, this problem can also occur because each field has too much variation.For solving this problem, there are a variety of solutions such as constraining the covariance to be spherical or diagonal, adjusting the prior, or trying a better initialization using a feature reduction.Among these solutions, we apply a feature reduction for a better initialization to our G-HMM learning.Through reducing the number of features, G-HMM has a more stabilized initialization for preventing to be singular matrix.

Gaussian Observation Hidden Markov Model (G-HMM).
HMM is one of the most popular means for classification with temporal sequence data [31,32].It is a statistical model with finite set of states, each of which is associated with a probability distribution.Transitions among the states are governed by a set of probabilities called transition probabilities.In a particular state, an observation can be generated, according to the associated probability distribution.It is only the outcome not the state visible to an external observer, and therefore states are hidden to the outside.Formally, HMM consists of the following parts: (i) T = length of the observation sequence, (ii) N = number of states of HMM, (iii) M = number of observation symbols, (iv) Q = {q 1 , . . ., q n }: states, (v) V = {v 1 , . . ., v n }: discrete set of possible symbol observations.
If we assume that HMM model is λ, this model is described as λ = (A, B, π) using the above characteristic parameters as shown in the following: where A is a probability distribution of state transition, B is a probability distribution of observation symbol, and π is a probability of initial state distribution.HMM can be described as discrete or continuous according to the modeling method of observable sequences.Formula ( 23) is suitable to HMM with discrete observation events.However, we assume that the observable sequences of internet traffic approximate continuous distributions.A continuous HMM has the advantages of using small input data as well as describing Gaussian-distributed model.If our observable sequences have Gaussian distribution, for a Gaussian pdf, the output probability of an emitting state, where N (•) is a Gaussian pdf with mean vector μ i and covariance i , evaluated at o t .M is the dimensionality of the observed data o.In order to make an appropriate G-HMM model for learning and evaluating, we use known HMM application problems in [14,35] as follows., o T ) and the model, how do we choose a corresponding state sequence q = (q 1 , . . ., q T ) that is optimal in some sense.
Problem 3. Given the observation sequences, how can the HMM be trained to adjust the model parameters to increase the probability of the observation sequences.
To determine initial HMM model parameters, we apply the third problem using Forward-Backward algorithm [35].Also, the first problem is related to a learning method to find the probability in the given observation sequences.In our scheme, Maximum Likelihood (ML) applies to the calculation of HMM learning model with Baum-Welch method.In other words, HMM learning processes use a repetitive Baum-Welch algorithm with the given sequences, and then ML is used to evaluate whether the given sequence includes normal behavior or not.
As we mention the third problem to decide on the parameters of an initial HMM model, we consider the Forward variable α t (i) = Pr(O = O 1 O 2 , . . ., O t , q t = S i | λ).This value denotes the probability at which a partial sequence O = {o 1 , . . ., o T } is observed and the state q i is S i at time t, given the model λ.This can be solved inductively as follows: Forward procedure (1) Initially: Similarly, we can consider the backward variable as Thus, we can make initial HMM model using (25) and (26).
After deciding on initial HMM model with Forward-Backward algorithm, we can evaluate abnormal behavior through calculating ML value.If we assume two different probability functions, the value of λ can be used as our estimator of causing a given value of o to occur.The value is obtained by using a procedure as an ML, λ ML (o).In this procedure, we can maximize the probability of a given sequence of observations O = {o 1 , . . ., o T }, given the HMM λ and their parameters.This probability is the total likelihood (L tot ) of the observations.Assume joint probability of the observations and state sequence, for a given model λ: To get the total probability of the observations, we sum across all possible state sequences: When we maximize probability Pr(O | λ), we need to adjust the initial HMM model parameters.However, there is no known way to analytically solve for λ = (A, B, π).Thus, we determine the parameters using the Baum-Welch method with an iterative procedure providing local maximization.Let ξ t (i, j) denote the probability of being in state q i at time t and in state j at time t + 1, given the model and the observation: Also, let γ t (i) be defined as the probability of being in state i at time t, given the entire observation sequences and model.This can be related to ξ t (i, j) by summing γ t (i) = N j=1 ξ t (i, j).If we sum over the time index t, it can be interpreted as the expected number of times that state i is visited or expected number of transitions made from state i.It is also the expected number of transitions from state i to state j.Using the concept of event occurrences, we can reestimate the parameters of new HMM, namely, λ = (A, B, π), Expected number of transitions from state i to state j Expected number of transitions from state i , Expected number of times in state j and observing symbol v k Expected number of times in state j . ( EURASIP Journal on Wireless Communications and Networking Hence, if we assume that internet traffic sequences are given after initial parameter setup by Forward-Backward algorithm, updating HMM parameters in accordance with the given sequences is the same as HMM learning to make new model λ = (A, B, π) and calculating a ML value about a specific internet traffic sequence.It is a process of G-HMM testing to derive L tot .

Experiment Datasets and Parameters
The 1999 DARPA IDS data set was collected at MIT Lincoln Lab to evaluate intrusion detection system, which contained a wide variety of intrusion simulated in a military network environment [20].The entire internet packet including the entire payload were recorded in tcpdump [36] format and provided for evaluation.The data consisted of three weeks of training data and two weeks of test data.Among these datasets, we used attack-free training data for normal behavior modeling, and attack data was used to the construction of anomaly score in Table 1.Moreover, for additional learning procedure and anomaly modeling, we generated a variety of anomaly attack data such as covert channels, malformed packets, and some DoS attacks.The simulated attacks were included in one of following five categories, and they had DARPA attacks and generated attacks: (i) Denial of Service: Apache2, arppoison, Back, Crashiis, DoSNuke, Land, Mailbomb, SYN Flood, Smurf, sshprocesstable, Syslogd, tcpreset, Teardrop, Udpstorm, ICMP flood, Teardrop attacks, Peer-topeer attacks, Permanent denial-of-service attacks, Application level floods, Nuke, Distributed attack, Reflected attack, Degradation-of-service attacks, Unintentional denial of service, Denial-of-Service Level II, Blind denial of service; (ii) Scanning: insidesniffer, Ipsweep, Mscan, Nmap, queso, resetscan, satan, saint; (iii) Covert Channel: ICMP covert channel, Http covert channel, IP ID covert channel, TCP SEQ and ACK covert channel, DNS tunnel; (iv) Remote Attacks: Dictionary, Ftpwrite, Guest, Imap, Named, ncftp, netbus, netcat, Phf ppmacro, Sendmail sshtrojan Xlock Xsnoop; (v) Forged Packets: Targa3.
In this experiment, we used soft margin SVM as a general supervised learning algorithm, one-class SVM as an unsupervised learning algorithm, and G-HMM.In order to make the dataset more realistic, we organized many of the attacks so that the resulting data set consisted of 1 to 1.5% attacks and 98.5 to 99% normal objects.For soft margin SVM, we consisted of learning dataset with above-described dataset.This dataset had 100,000 normal packets and 1,000 to 1,500 abnormal packets for training and evaluating each.
In the case of unsupervised learning algorithms which were one-class SVM and G-HMM, the dataset consisted of 100,000 of normal packets for training and 1,000 to 1,500 of various kinds of packets for evaluating.In other words, in case of one-class SVM, the training dataset had only normal traffic because they had unlabeled learning ability.
In case of G-HMM, G-HMM made a normal behavior model using normal data, and then G-HMM calculated the ML values of the normal behavior model and test dataset.
Then the combined dataset with normal and abnormal is tested.SVM has a variety of kernel functions and their parameters, and we had to decide a regularization parameter, C. The kernel function transforms a given set of vectors to a possible higher-dimensional space for linear separation.For SVM learning, the value of C was 0.9 to 10, d in a polynomial kernel was 1, σ in a radial basis kernel was 0.0001, κ and θ in a sigmoid kernel were 0.00001 each.The SVM kernel functions that we considered were linear, polynomial, radial basis kernels, and sigmoid as follows: sigmoid with parameter κ and θ: For G-HMM learning algorithm, input data was presented as N × p data matrix.N was the number of all inputs and p was the length of each input.The number of states could be adjusted with various numbers.In this experiment, the default state was 2, and we used 4 and 6 states.Maximum number of cycles of Baum-Welch was 100.In our experiment we used the SVMlight, Libsvm, and HMM tools [37][38][39].

Experimental Results and Analysis
In this section we detail the entire results of our proposed approaches.To evaluate our approaches, we used three performance indicators from intrusion detection research.
The correction rate is defined as the number of correctly classified normal and abnormal packets divided by the total size of the test data.The false positive rate is defined as the total number of normal data that were incorrectly classified as attacks divided by the total number of normal data.The false negative rate is defined as the total number of attack data that were incorrectly classified as normal traffic divided by the total number of attack data.

Field Selection Results
. We discuss field selection using GA.In order to find reasonable genetic parameters, we made preliminary tests using the typical values mentioned in the literature [11].Figure 4 shows four graphs of GA feature selection with the fitness function (10) according to Table 2 parameters.In Case no.1 and Case no.2, the resultant graph seems to have rapidly converging values because of too low reproduction rate and too high crossover and mutation rate, respectively.In Case no.3, the graph seems to be constant values because of too high reproduction rate.Finally, the fourth graph of Case no.4 seems to be converging with appropriate values.The detailed results of Case no.4 are described in Table 3.
Although we found the appropriate GA condition for our problem domain by the preliminary tests, we tried to optimize the best generation from total generations.Through using c-SVM learning, we knew that the final generation was well optimized.Generation 91-100 showed the best correction rate and relatively fast processing time.Moreover, as comparing generation 16-30 with generation 46-60, fewer fields do not always guarantee faster processing because the processing time is also dependant on the value of the fields.

SVM Results.
In this resultant analysis, the two SVMs were tested as follows: soft margin SVM as a supervised method and one-class SVM as an unsupervised method.The results are summarized in Table 4.Each SVM approach was tested with four kinds of different SVM kernel functions.The high performance of soft margin SVM is not surprising since it uses labeled knowledge.Also, four SVM kernels showed similar performance in experiments on soft margin SVM.
In case of one-class SVM, RBF kernel provided the best performance (94.65%).However, the false positive was high as in our previous consideration.Moreover, we could not see the result of sigmoid kernel experiment because the sigmoid kernel was overfit.Moreover, in one-class SVM experiments, the experiment results were very sensitive to choose a kernel.In these experiments, the PMAR value of 7 and the support rate of 0.33 were used.

One-Class SVM versus G-HMM Results.
Even though the false rates of one-class SVM is high, one-class SVM showed the similar correction rate in comparison with soft margin SVM, and it does not need preexisting knowledge.Thus, in this experiment, one-class SVM with PMAR was compared with G-HMM.The inputs of the one-class SVM were preprocessed using PMAR with two unit sizes (5 and 7) and minimum support rate 0.33.Moreover, G-HMM was learned with three states (2, 4, and 6).In data preprocessing of G-HMM, we used a feature reduction to prevent covariance matrix from being singular matrix.Let us think about the number of features.The total size of TCP and IP headers is 48 bytes (384 bits) long.Each option field is assumed to be 4 bytes long.And the smallest field of TCP and IP header is 3 bits.So the number of features can be 128(384/3) maximum.Our feature reduction converts two bytes into one feature of G-HMM.If the size of a field is over two bytes, the field is divided by each two and converted into one feature each for G-HMM.Thus, total features can be ranged between 128 and 20.
From results shown in Table 5, the better the performance of the one-class SVM presented, the bigger the PMAR size.In contrast, the smaller the number of G-HMM states, the better the correction rate.Although G-HMM showed better performance in estimating hidden temporal sequences, the false alarm rate was too high.In this comparison experiment, probabilistic sequence estimation of G-HMM was superior to one-class SVM with PMAR method.However, one-class SVM provided more stable correction rate and false positive rate.

Cross-Validation
Tests.Cross-validation test was performed using 3-fold cross-validation method on 3,000 normal packets which were divided into 3 subsets, and the holdout method [40] was repeated 3 times.Specifically, we used one-class SVM with PMAR size 7 because this scheme showed the most reasonable performance among our proposed approaches.Each time we ran a test, one of the 3 subsets was used as the training set, and all subsets were put together to form a test set.The results were illustrated in Table 6 and showed that our method depends on which training set was used.In our experiments, the training with validation set no.1 showed the best correction rate across all of the three cross-validation tests and a low false positive rate.In other words, the validation set no.1 for training had well-organized normal features.Especially, validation set no.1 for training and validation set no.3 for testing showed the best correction rate.Even though all validation sets were attack-free datasets from MIT Lincoln Lab, there were many differences between validation sets.As a matter of fact, this validation test depends closely on how well the collected learning sets consist of a wide variety of normal and abnormal features.

Conclusion
The overall goal of our temporal relation based on machine learning approaches is to be a general framework for detecting and classifying novel attacks in internet traffic.We designed four major components: the field selection component using GA, the data preprocessing component using PMAR and HMM reduction method, the machine learning approaches using SVMs and G-HMM, and the verification component using m-fold cross-validation.In the first part, an optimized generation of field selection with GA had relatively fast processing time and better correction rate than the rest of the generations.In the second part, we proposed PMAR and HMM reduction method for data preprocessing.PMAR was used to support temporal variation between learning inputs in SVM approaches.HMM reduction method was applied to make more welldistributed HMM sequences for preventing singular matrix during HMM learning.In the third part, our key machine learning approaches were proposed.One of them was to use two different SVM approaches to provide supervised and unsupervised learning features separately.For comparison between SVMs, one-class SVM with an unlabeled feature showed a correction rate similar to the soft margin SVM.The other machine learning approach was to estimate hidden relations in internet traffic using G-HMM.In the case of G-HMM approach, it proved to be one of the best solutions to estimating hidden sequences between packets.However, its false alarm was too high to allow it to be applied to real world.
In conclusion, when we considered temporal sequences of SVM inputs with PMAR, one-class SVM approach had better results than G-HMM approach.Moreover, our one-class SVM experiment was verified by m-fold cross-validation.Future work will involve trying to find a solution for decreasing false positive rates in one-class SVM and G-HMM, considering more realistic packet association such as more elaborated flow generation over PMAR and G-HMM and applying this framework to the real world over TCP/IP traffic.

Figure 1 :
Figure 1: The overall structure of our proposed approach.

Figure 3 :
Figure 3: One-class SVM; the origin means the only original member of second class.

Figure 4 :
Figure 4: Evolutionary process according to preliminary test parameters.

Table 1 :
TCP/IP anomaly and communication score.
Table 2 describes the 4 times preliminary test results.

Table 2 :
Preliminary test parameters of GA.

Table 3 :
GA field selection results of preliminary test no.4.CR: correction rate, FP: false positive, FN: false negative, PT: processing time. *

Table 4 :
The overall experiment results of SVMs.

Table 5 :
The overall experiment results of one-class SVM versus G-HMM.