Sequence-order-independent network profiling for detecting application layer DDoS attacks

Distributed denial of service (DDoS) attacks, which are a major threat on the Internet, have recently become more sophisticated as a result of their ability to exploit application-layer vulnerabilities. Most defense methods are designed for detecting DDoS attacks on IP and TCP layers and consequently have difficulty in detecting this new type of DDoS attack. With the profiling of web browsing behavior, the sequence order of web page requests can be used for detecting the application-layer DDoS (App-DDoS) attacks. However, the sequence order may be more harmful than helpful in the profiling of web browsing behaviors because it varies significantly for different individuals and different browsing behaviors. This article introduces a sequence-order-independent method for the profiling of network traffic and the detection of a new type of App-DDoS attacks. Four attributes are extracted from web page request sequences without consideration of the sequence order of requested pages. A model based on the multiple principal component analysis is proposed for the profiling of normal web browsing behaviors, and its reconstruction error is used as a criterion for detecting DDoS attacks. The proposed method is experimentally confirmed with various types of new App-DDoS attacks.


Introduction
Distributed denial of service (DDoS) attacks have become a major threat and one of the hardest problems to overcome on the Internet. Various activities, such as telecommunication, online banking, and online shopping, have recently been integrated through the Internet, yet the Internet is now plagued by more than 10 million infected hosts (or zombies) [1]. DDoS attacks have consequently become a serious threat.
DDoS attacks traditionally exploit the vulnerabilities of a network layer, particularly SYN flooding, UDP flooding, and ICMP flooding. These attacks exhaust the network bandwidth and resources of the victim; so that legitimate access is denied. Although many studies have developed defense methods, sophisticated DDoS attacks can now overcome these methods.
One recent sophisticated DDoS attack is called an application-layer DDoS (App-DDoS) attack [2]. Unlike conventional DDoS attacks, this type of attacks exploits vulnerabilities at the application layer rather than at the network layer. App-DDoS attacks send small packets of legitimate content via normal successful TCP connections; no spoofed IP address with standard services such as HTTP and HTTPS. Thus, the DDoS defensive methods mistakenly regard App-DDoS attacks as normal connections. Furthermore, App-DDoS attacks are similar to a flash crowd event, which happens when massive numbers of normal users simultaneously send requests to one web server [3]. Hence, it is difficult to distinguish App-DDoS attacks from legitimate normal traffic.
There are several DDoS defense methods that utilize application-layer information. Ranjan et al. [4] analyzed time-related characteristics of HTTP sessions, such as session inter-arrival time, request inter-arrival time, and session arrival time. Yatagai et al. [5] presented a method that analyzes the correlation between browsing time and page information size. However, time-related features are insufficient to detect App-DDoS attacks because attackers can easily control packet-sending rates by utilizing a large-scale botnet [6]. On the other hand, Kandula et al. [7] developed a system that protects a web server from DDoS attacks by implementing a probabilistic authentication method using CAPTCHAs, but the task of requiring users to solve graphic puzzles causes additional service delays. As a result, the graphic puzzles cause annoying legitimate users as well as act as another DDoS attack point.
For the detection of App-DDoS attacks, Xie et al. [8] used a hidden semi-Markov model (HsMM) to describe the normal browsing behavior of web users. The HsMM uses the sequence order of web page requests to profile normal web browsing behavior. To detect App-DDoS attacks, they defined a normality threshold and compared it with the model's output values of incoming users. However, the sequence-order-based method can be complex and may cause many false alarms. The sequence order might vary significantly for different individuals and for different browsing behaviors. For example, web users can directly type URLs to request resources or utilize external web links. Furthermore, they can browse the resources of the web server with multiple browsers, possibly causing changes in the relative sequential positions.
In this article, we propose a sequence-order-independent method that distinguishes App-DDoS attacks from normal traffic. We regard the sequence order as a kind of noise rather than good information. We first extract the sequence-order-independent informative attributes from web page request sequences; these attributes represent a web user's activeness, pages of interest, and the breadth and intensity of their interest. We describe them in a matrix and use multiple principal component analysis (PCA) to model the browsing patterns. We then use the reconstruction error of the multiple PCA as a criterion for distinguishing App-DDoS attacks from normal usage.
This article is organized as follows. In Section 2, we describe App-DDoS attacks. In Section 3, we propose a new App-DDoS detection method that includes new sequence-order-independent attributes and a multiple PCA model. In Section 4, we validate our detection model with real traffic and discuss an early warning system. Finally, in Section 5, we present our conclusions.

App-DDoS Attacks
App-DDoS attacks exploit victim servers by exhausting the resources such as sockets, CPU, memory, and disk bandwidth. According to [9], server resources may become the bottleneck of the Internet applications. App-DDoS attacks are also launched on mobile devices, such as smart-phones and ubiquitous sensors, because they require few resources in the client side. Thus, App-DDoS attacks cause more serious problems than in the past. Figure 1 shows an example of an App-DDoS attack. The attacker first exploits a popular web server, the worm distribution server, to insert malicious codes. It causes web users to download malicious codes and makes their hosts become infected. When the attacker begins an attack via the command server, an excessive number of infected hosts make requests for web pages from the victim; as a result, the victim's resources are eventually exhausted.
One of the differences between conventional DDoS and App-DDoS attacks is that App-DDoS attacks utilize only legitimate methods instead of vulnerabilities of protocols. App-DDoS attacks usually send small packets via successful TCP connections, and real IP addresses thus are used to launch attacks. In particular, App-DDoS attacks send packets through standard services such as HTTP and HTTPS. Moreover, these packets are generated with various sending rates to mimic legitimate users. Thus, these application-layer requests are indistinguishable from those generated by legitimate users.

Proposed method
In this section, we propose a defensive method that can be used for detecting an App-DDoS attack. We show how to represent a set of web browsing behaviors using sequence-order-independent attributes instead of web page request sequences. We then present a multiple PCA model to profile normal web browsing patterns and distinguish App-DDoS attacks. Since DDoS attack detection systems are required to handle an extremely large volume of traffic, we base our description of the web browsing patterns on PCA instead of nonlinear methods such as kernel methods and manifold learning [10,11].

Sequence-order-independent attributes
We represent each request sequence as a vector form of extracted attributes. Let us assume that N users browsed a web server where the total number of web pages is D. For user i who browsed this server, let s i be the web page request sequence and η d, i be the number of requests of user i for page d of the server. Then, Now, let us define several sequence-order-independent attributes for detecting App-DDoS attacks. To give a clearer representation of active user i, we introduce the attribute which is the ratio of the number of requests of a user and their average value. The next attribute, is the proportion of page d among pages requested by user i; it shows how much user i was interested in page d of the server.
To help determine whether incoming users are indicative of a DDoS attack, we supplement the two basic attributes with two other attributes that characterize the web browsing patterns of a user. The first supplementary attribute is defined as the proportion of all server pages requested by user i, i.e., where b d, i is an indicator that equals 1 if η d, i > and 0 otherwise. This attribute represents the breadth of user i's interest.
The next supplementary attribute shows the intensity of interest in the user's page of greatest interest. This attribute for user i can be defined by using q i = argmax d {η d, i } to denote the most frequently requested page. Thus, the intensity of the user's interest in the page of greatest interest can be represented as follows by the ratio of the number of requests for the page of greatest interest and the number of page requests: From these attributes, we denote a attribute vector as

PCA for web browsing behaviors
PCA is the simplest statistical method for transforming given data to new coordinates called principal components. By removing less important components, it can reduce the number of dimensions required to explain the given data. The reduced subspace best represents the given data in a least-squares sense.
We use PCA to model web browsing patterns; the modeling is based solely on normal users' attributes. To model the web browsing patterns, we first denote a mean vector, μ 0 , and a covariance matrix, C, as x N ], x i = w i -μ 0 , and i = 1,...,N. We then compute the eigenvectors and eigenvalues by applying singular value decomposition to the covariance matrix.
If we let u j be the jth most significant eigenvector of covariance matrix C, then the significant principal components are denoted byŨ = u 1 . . . u p , where P (≪ D) is the number of significant principal components. Since the remaining eigenvectors [u P+1 ...u D+3 ] are less significant, we can reduce the dimensions of the data without significant loss when these eigenvectors are discarded.
If the attribute vectors are projected into the subspace spanned by P significant principal components, then we can represent the attribute of web user i in terms of the following P-dimensional coefficient vector: This coefficient indicates how much each principal component contributes to the representation of the given attribute.

Multiple PCA model
Describing real traffic via a single PCA model is difficult because the traffic data usually include many patterns, variations, and different types of noise. We therefore propose to use a multiple form of PCA for effective modeling of real traffic. For the multiple PCA model, we use the k-means clustering method to partition the given data into several clusters [12]. The k-means clustering is a well-known algorithm for unsupervised clustering, but is inappropriate for sparse or concave-shaped data [13,14]. With our attributes, it is frequently the case that a particular data element may remain zero because some web pages may not be requested for a long period. Accordingly, our attribute matrix may have a high degree of sparsity. To overcome this problem, we perform k-means clustering on the values of the PCA coefficient, a i , instead of the values of the raw data, w i . Because the a i values are low-dimensional and not sparse, we can easily partition the given attributes into several clusters with the coefficient.
Next, we build a PCA model on each cluster. Let w i be user i's attribute vector that belongs to cluster k. For each cluster, we first normalize the attribute vector by x i (k) = (w i (k) -μ (k) )/(σ (k) ) 2 , where μ (k) and (σ (k) ) 2 are the mean and variance vectors for cluster k, respectively. We then compute P-significant principal components, U (k) , for cluster k as described in the previous section. Once the principal components of cluster k are computed, we can reconstruct the original attribute vector with only P principal components. The reconstructed data ofx (k) i and their errors, ε i , are denoted as follows: Designed exclusively for normal traffic, our PCA model produces a good representation of the attributes of normal traffic but a poor representation of the attributes of unseen traffic. As a result, the reconstruction error is low for normal behavior but high for abnormal.
We regard the high reconstruction errors of the PCA as statistical outliers. Hence, we choose a threshold, δ (k) , of cluster k and use it as follows to determine whether the given web browsing behavior is normal: where E[ε] and E[ε -E[ε]] 2 are the mean and variance of the reconstruction errors for model k, respectively. The β value is a scale factor for defining the outlier range. According to studies on outlier detection, the outlier range should deviate from the mean by more than two or three standard deviations [15].

Detection method
If a new web user, t, requests a web page from the server, then we first form attribute vector w t and determine the best fitting model as follows: where k = 1,...,K. The value K is the total number of clusters, and m k is the mean of the PCA coefficients for cluster k. After selecting the best fitting model, π, we normalize w t using μ (π) and σ (π) . Finally, the model compares the reconstruction error, ε t , with the error threshold, δ (π) . If ε t >δ (π) , then the model regards the current user as an App-DDoS attack. Figures 2 and 3 show the pseudocodes of the proposed method, which includes model training and testing.

Datasets
To validate our App-DDoS attack defense method, we used the web-logs from real websites: an educational website, a community website, and an online shopping website. In the educational website, students frequently request some parts of webpages which include educational information. In the community website, users repeatedly request the same webpage which can show different contents using server scripts, such as PHP and ASP. In the online shopping website, customers widely search for their interesting items and wander aimlessly with multiple browsers. We extracted host IPs, request times, and requested webpages from the web-logs, and then constructed sequence datasets; including 30140, 26727, and 99018 sequences. The characteristics of datasets are shown in Table 1. We randomly selected half of the normal sequences to train the model, and for the testing, we used the remainder of the normal sequences as well as the attack sequences. To reliably validate our detection model, we repeated all our experiments with 20 different random seeds.

App-DDoS attacks
To validate our detection model, we performed experiments with three types of attacks. Most studies on App-DDoS attacks restrict their experiments to random page attacks, which involve requests for randomly selected pages from the web server. Random page attacks can be easily detected by defense methods because their patterns are quite different from the behavior of a normal user. However, as mentioned, DDoS attackers have become more sophisticated and now tend to mimic the normal user behavior. Our experiments therefore include additional types of attacks that mimic the patterns of normal users, particularly main page attacks and dominant page attacks. A main page attack repeatedly requests the main page, which is the page most frequently requested by users; a dominant page attack randomly requests any of the pages frequently requested by most users.
For the App-DDoS attack datasets, we initiated App-DDoS attacks on a web server featured in other studies [4,5,8]. The attack traffic was generated in a network of the Korea Advanced Institute of Science and Technology (KAIST) via a modification of black energy, which is a well-known DDoS attack tool. We constructed copies of websites in the KAIST network as target systems of the attacks because of security problems. During the course of the App-DDoS attacks, we collected web logs from the web server and extracted web page request sequences. The collected sequences include 9,492 main page attacks, 14,038 random page attacks, and 9,489 dominant page attacks.

Results and analysis
To construct our App-DDoS defense model, we first projected the proposed attribute matrix into a subspace spanned by the initial principal components,Ũ , because it was highly sparse matrix. Then, we partitioned the data into several groups by k-means clustering based on the initial PCA coefficients. The parameter k was initially set to a sufficiently large number (k = 100) and clusters were merged if similar. Then, we constructed the multiple PCA models. We used seven principal components with the consideration of complexity and overfitting problems [15]; however, it could be adjusted as needed.
In this section, the performance of App-DDoS defense methods is illustrated with several sets of results. We first confirm whether the proposed attributes give good information to discriminate App-DDoS attacks or not. Figure 4 shows the results of the measurement for a i , τ i , and g i . In this figure, the measured values for normal traffic are shown in first row, and those of attacks are shown in second row. In Figure 5, the values for the h i are also plotted. It is noted that the dimension reduction is applied for visualizing a multi-dimensional vector, h i . We can see the differences between normal traffic and attacks in term of these values. Hence, we regard these values as important attributes even though they discard sequence-ordered information.
The proposed method discriminates App-DDoS attacks based on the reconstruction errors of the PCA. Thus, we show the reconstruction errors of a model, k, as shown in Figure 6. Here, the dots and the circles indicate the reconstruction errors of normal traffic for training data and validation data, and the axes indicate those of attack traffic for test data. Since our principal components are obtained to represent only normal traffic, the reconstruction errors for normal traffic are naturally low. On the contrary, the reconstruction errors for attack traffic are much higher than those of normal traffic. It is noted that some high values of this figure are set to 70 so as to visualize them.
We compared our method with a sequence-orderbased defense method on the standard metrics such as the detection rate (DR), false positive rate (FPR), and  receiver operating characteristic (ROC) curves. For that reason, we constructed a sequence-order-based defense method, which is modeled by the HsMM [8]. Webpage sequences for the HsMM were extracted from our weblogs. The best parameters were selected after experiments with a variety of parameters were conducted; initial parameters such as prior probabilities, transition matrix, and observation matrix were randomly set. For the purpose of comparison, we first conducted experiments on various decision thresholds. The DR and the FPR were measured on each dataset and averaged. Table 2 shows the results of the experiment. As the threshold became large, the models yielded a relatively low DR and a low FPR, and vice versa. Our method tended to detect more App-DDoS attacks on all thresholds but its FPR was slightly higher than that of the HsMM. Figure 7 shows the ROC curves comparing our method and the HsMM. In this figure, we can see that the DR of our method is higher than that of the HsMM at most FPRs. When we carefully take into   account the DR, the FPR, and ROC curves, the threshold can be set between μ + 2σ and μ + 3σ. This result is the same as that of studies based on outlier detection: the thresholds are usually set to two or three standard deviations from the mean [16]. In this experiment, our detection model has a DR of 86.7% and an FPR of 4.5% for a threshold of μ + 2.5s, and the HsMM has a DR of 74.3% and an FPR of 5.4% for a threshold of μ + 2s.
As previously mentioned, we performed experiments with three types of App-DDoS attacks. The DRs for each type of attack are shown in Table 3. Performance evaluations were conducted on each dataset because browsing behaviors can be different for different websites. As we expected, random page attacks tend to be detected well by defensive methods and the dominant page attacks tend to be detected less. We can see that our method outperforms the HsMM in most cases. The HsMM shows good performance for random page attacks but poor performance for main page and dominant page attacks. This means that random page attacks generate disordered sequences, while other attacks generate natural sequences by mimicking normal behaviors. In addition, as shown in dataset3 of this table, we can find that sequence orders can be broken in complicated websites that have a lot of hyper links. On the other hand, our method can detect the attacks better than the HsMM. One of the reasons is that we utilize specific numerical values for detecting App-DDoS attacks and we employ the divide and conquer approach.
One of the advantages of our detection method is to enable additional information to be easily inserted into the model. For the purpose of adding attributes, our method can use any types of numerical attributes, while the HsMM requires only types of ordered data. Thus, network experts' opinions can easily be applied to our method for improving the detection performance. For example, it was difficult to detect main page attacks on the community website because the same page was also repeatedly requested by normal users. In this case, we can utilize information about the HTTP status codes and parameters of server scripts. Thus, we extracted this information from weblogs and simply attached two additional values to the attribute vector: v 1 = e i /L i and v 2 = p i /L i , where the e i is the number of requests whose status code includes an error (i.e., 404 Not Found) in a sequence i. The value p i is the number of the pages that include the same script parameters (i.e., "id = 1") in a sequence i. Figure 8 shows the results of measurement for normal traffic and attacks. The main page attack includes three types of script parameter patterns: (1) no script parameter, (2) random script parameters, and (3) increasing script parameters. Adding new attributes helps improving the DR for the main page attacks. For main page attacks on dataset2, the DR is originally 49.1% but we can achieve a DR of 83.7% as shown in Table 3 after attaching two new attributes into the attribute vector.

Early warning system
The proposed method can be used as an early warning system. A detection method that utilizes sequences should determine the sequence length at which the algorithm is launched. The general principle for determining the sequence length is as follows: the shorter the sequence, the earlier the response-albeit with less reliability. Figure 9 shows the average reconstruction errors as the sequence length varies. The reconstruction error for legitimate users remains at a certain level whereas  the error for attacks becomes large as the sequence length varies from short to long. Therefore, if the DDoS detection model is launched with an appropriately short sequence length, the model can act as an early warning system. Although the proposed model may make incorrect decisions, it gradually refines its decisions as enough requests are obtained.

Conclusion
The focus of this article is on detecting App-DDoS attacks. We proposed a new model that utilizes sequence-order-independent attributes rather than the web page sequence order. The model consists of multiple PCAs so that complex browsing behaviors are given maximal consideration. Requiring only the weblog and the simplest of computations, the proposed method is practical and efficient for detecting App-DDoS attacks in real environments. To reliably validate our model, we generated three types of App-DDoS attacks. Our method detects App-DDoS attacks with an averaged DR of 86.7% and an averaged FPR of 4.5% when the error threshold is set at μ + 2.5s. These values demonstrate  that the proposed method can effectively describe a web user's browsing behavior and detect App-DDoS attacks.
In addition, the proposed model has the capability of acting as an early warning system. Future research will focus on efficient updates and automatic learning algorithms, so that we can develop a defense method with online updates.