Intrusion detection in internet of things using supervised machine learning based on application and transport layer features using UNSW-NB15 data-set

Internet of Things (IoT) devices are well-connected; they generate and consume data which involves transmission of data back and forth among various devices. Ensuring security of the data is a critical challenge as far as IoT is concerned. Since IoT devices are inherently low-power and do not require a lot of compute power, a Network Intrusion Detection System is typically employed to detect and remove malicious packets from entering the network. In the same context, we propose feature clusters in terms of Flow, Message Queuing Telemetry Transport (MQTT) and Transmission Control Protocol (TCP) by using features in UNSW-NB15 data-set. We eliminate problems like over-fitting, curse of dimensionality and imbalance in the data-set. We apply supervised Machine Learning (ML) algorithms, i.e., Random Forest (RF), Support Vector Machine and Artificial Neural Networks on the clusters. Using RF, we, respectively, achieve 98.67% and 97.37% of accuracy in binary and multi-class classification. In clusters based techniques, we achieved 96.96%, 91.4% and 97.54% of classification accuracy by using RF on Flow & MQTT features, TCP features and top features from both clusters. Moreover, we show that the proposed feature clusters provide higher accuracy and requires lesser training time as compared to other state-of-the-art supervised ML-based approaches.

IoT is a collection of linked, interconnected or interlinked digital devices, mechanical equipment, entities or items, creatures or individuals, equipped with unique identification and the capacity. This involves the ability to direct information and commands over a typically wireless connection without involving interaction of humans either with computers or humans itself. With every passing day, hackers are becoming smarter and much more aggressive. According to Threatpost (a leading news site with information technology-related news), almost 98% of IoT devices traffic is in plain and more than 50% of IoT devices are vulnerable to high or medium risk [3]. With the scale, pace and complexity of today's risky environment, we ought to be capable of responding to dangers posed by such attacks in a timely and effective manner.
By detecting malicious traffic in IoT, it is ensured that IoT devices stay connected at higher level of connectivity without any interruption. Moreover, the IoT devices must also stay safe from the hackers and it can be done by keeping malicious traffic away from IoT devices. Data transmission should also be kept secure and consistent without any corruption. Efficient detection of Denial of Service (DoS) attacks is very critical in ensuring reliable communication between IoT devices. We have considered all of these problem and propose and efficient solution for detection of malicious traffic before it is transmitted through IoT device.
There are many publicly available data-sets for doing research on Intrusion Detection System (IDS). The most widely used among them are DARPA 98 [4], KDD Cup 99 [5], NSL-KDD [6] and UNSW-NB15 [7]. DARPA 98 was first made available in February 1998. It contains several weeks of network data and audit logs, but this data-set does not depict real-world traffic. KDD Cup 99 was made public in 1999 and is based on an improved version of DARPA 98. However, this data-set contains problems like duplicate and redundant records. This data-set is most widely used for IDS. NSL-KDD is a refined form of KDD 99 data-set in which the problems pertaining to KDD 99 data-set are removed. It was made public in 2009. UNSW-NB15 is the latest data-set on NIDS which came out in 2015. This data-set contains most comprehensive attack scenarios.
UNSW-NB15 data-set is created by using IXIA PerfectStorm device (used to test the security of devices) in Cyber Range Lab of Australian Centre for Cyber Security, which generated the modern real scenario-based normal and attack traffic. Besides normal traffic, it contains nine attack scenarios used in today's world. Approximately 2.5 million packets are captured and are publicly available. This data-set is available in the form of BroIDS, csv, pcap and argus. Besides full traffic, the authors also created csv files from the full data-set which contains approximately 10% of the data. This is the latest benchmark data-set for NIDS scenario.
Many researchers have used UNSW-NB15 data-set for evaluating IDS. Lopez Martin et al [8] applied a kernel approximation technique in SVM to approximate Radial Basis Function (RBF) and used combination of UNSW-NB15 data-set with NSL-KDD and Moore data-set. Moustafa et al [9] used UNSW-NB15 and Network Information Management and Security Group (NIMS) data-set to extract features relevant to Domain Name System (DNS), Hyper Text Transfer Protocol (HTTP) and MQTT attacks and applied three ML techniques of Decision Trees (DT), Naive Bayes (NB) and ANN. Zhou et al [10] used Deep Feature Embedding Learning (DFEL) technique to extract high-level features from the data-set and then uses ML techniques (Gradient Boosting Tree (GBT), K-Nearest Neighbors (KNN), DT, Logistic Regression (LR), Gaussian Naive Bayes (GNB), SVM) using those features for evaluation. Kumar et al [11] created different number of clusters of UNSW-NB15 data-set and evaluated there efficiency using Silhouette's measure (used to check the consistency of data clusters) and then evaluated using several ML techniques based on DT.
Against the same backdrop, this work proposes a framework for defense of IoTbased networks from cyber-attacks using NIDS. The proposed model uses conventional supervised ML techniques including RF, SVM and ANN. The estimators are trained using a set of features, and the trained model is used to detect malicious traffic. Moreover, various subsets of feature set are also used in order to identify malicious traffic enabling a reduced set of feature which can be used to detect malicious traffic. We also calculate mathematical properties by combining traffic flows through possible study of flow parameters selected from MQTT and TCP parameters. After removing many discrepancies from data-set, we identify features from flow/MQTT and TCP protocols after removing the features causing over-fitting. However, top contributing features in flow/MQTT and TCP clusters are used for classification. These clusters are selected after removal of critical issues from data-set like over-fitting, imbalance nature of data-sets, the curse of dimensionality, datatype compatibility and null.
NIDS are capable of identifying malicious network traffic by using a set of characteristics that are derived from the application, transport and network layers. When positioned at a specific level or location inside a network to track data into and out of all machines on the network, the IDS can conduct a traffic forwarding analysis in order to compare the information being forwarded to a ML model which is trained on wellknown and well-documented threats. Once an intrusion is detected by the IDS, warning messages can be sent to the administrator to alert them. Developing and implementing an efficient NIDS involves a data-source that includes a collection of appropriate parameters for calculating the output when categorizing normal and malicious traffic instances by using a decision-making tools. This work specifically addresses this problem by proposing efficient network detection system for IoT devices.
We structured our research in a way that we first identified the most relevant data-set related to IDS for IoT, i.e., UNSW Bot-IoT data-set. Then, the data-set is brought into the algorithm executable form by performing pre-processing. To classify the network traffic in IoT, our main focus remained on features in the data-set related to flow/MQTT protocol and TCP protocol. For comparison of our identified features in clusters, we also classify network traffic by using full features and by using only the top most contributing features in flow/MQTT and TCP protocol clusters. All four clusters are then used for classification by ML algorithms, and the accuracy results are evaluated.
Key contributions of this work are as follows: 1 Missing value imputation using three different techniques, i.e., mean, linear regression, multiple imputations. 2 Binary and multi-class classification of malicious and normal packets using full features (37) by employing three different supervised learning classifiers: RF, SVM and ANN.
3 Binary and multi-class classification of malicious and normal packets using TCP features (18) by employing three different supervised learning classifiers: RF, SVM and ANN. 4 Binary and multi-class classification of malicious and normal packets using Flow and MQTT features (13) by employing three different supervised learning classifiers: RF, SVM and ANN. 5 Binary and multi-class classification of malicious and normal packets using top contributing features selected from TCP and flow & MQTT features set (11) by employing three different supervised learning classifiers: RF, SVM and ANN.
The remainder of the paper is divided into the following sections. Section 2 covers background and previously conducted work in this field, Sect. 3 details the proposed solution, Sect. 4 explains the results and outcomes of the experiments performed using the methodology explained in Sect. 3 and compares it with the state-of-the-art and also details the analysis of results obtained in these experiments. Section 6 concludes this work.

Related literature
Primary focus of this work is to develop an effective NIDS that is able to detect malicious traffic in order to prevent attempts to manipulate IoT operations and resources. A considerable amount of literature had been published on NIDS [12][13][14]. These studies are motivated by issues arising from handling large amount of network data and dynamic nature of data.

NIDS
Industrial NIDS mainly employ either quantitative measurements or derived specifications on feature collections like packet size, inter-arrival time, stream length as well as other network data parameters to efficiently predict them in a fixed time frame [15]. They are suffering from both high false positives and false negatives rates. A significant rate of false negatives indicates that perhaps the NIDS will misidentify threats quite often, as well as a higher level of false positives implies that the NIDS will be falsely notified when there is no actual attack. Those industrial approaches are thus inadequate for threats of the modern era. Auto-learning method is among the powerful ways of coping with attacks of the present day. These utilizes ML techniques of supervised, semi-supervised and unsupervised to identify the trends of different traditional and hostile behaviors with such a wide repository of both the Normal and threat network including happenings at the host site. While the research includes numerous solutions based on ML, the relevance toward commercial products is in initial stages [16]. The latest approaches focused on ML produce higher false positives with heavy processing costs [17]. That is because ML techniques locally learn the features of basic TCP / Internet Protocol (IP) functions.
A wide research in academics employed the analysis of the de facto standardized baseline data, KDDCup 99 to enhance the rate of Intrusion Detection (ID) effectiveness. KDDCup 99 was generated using tcpdump data from the 1998 DARPA ID assessment framework. The objective was to construct a prediction model that would divide the connection associations into Regular or Attack classes. Attacks are classified into the classes of DoS, Probe, Remote to Local (R2L), and User to Root (U2R). In the competition of KDDCup 99, as a feature structure framework, mining audit data for ID automated models (MADAMID) was used [18]. There are 41 features in MADAMID. The breakdown includes 9 packet features, 13 content features, 9 features related to traffic and 10 features are based on host. Two variants of data-set are available, complete and 10%.
The comprehensive evaluation findings of the contest of KDDCup 98 and 99 have been released in [17]. There were a total of 24 entries in the KDDCup 98 which indicated only the marked statistical importance of results in three successful entries using the variations of the DT. The 9th winner of the competition used the 1-nearest neighbor classifier. Between the 17th and 18th case, the first important output variation was noticed. This led to the initial 17 requests being rigorous and outed by [17]. The mission of the Third International Knowledge Discovery and Data Mining Tools Competition remains as a benchmark work, resulting in identification of several ML approaches. In most cases, among the results published, only 10% training and evaluation data was taken and, in few cases, the personalized data-sets were built. A detailed literature review has lately been carried out on the machine-based learning ID using KDDCup 99 data-set [19].
Following the competition, most of KDDCup 99's published findings employed many feature construction methods to reduce the dimensionality [19]. Although few researchers used custom data-sets, the bulk employed the similar data-set for freshly developed techniques in ML [19]. Such released results are in part comparable with contest results of KDDCup 99. The classification method employed in [20] comprises of P-rules and N-rules for predicting the presence and absence of class, respectively. This worked effective in competition with the preceding findings of KDDCup 99, apart from the U2R class.
For IDS with one of the most commonly deployed data-set, KDDCup 99, the importance of feature credibility assessment was explored in [21]. In [22], RF variations are addressed. They detect misuse by identifying intrusion patterns, detect anomaly by identifying mechanisms for outliers and by combining both techniques they created a hybrid technique. As related to previously published techniques in anomaly identification using unsupervised learning, the anomaly identification method using misuse approach shows improvement even from the results of top contestants of KDDCup 99 challenge. By combining the detection techniques of misuse and anomaly in hybrid system gives the advantage of improved performance [23][24][25].
A weak classifier of decision stumps with the AdaBoost method was employed as an ID technique [26]. With low complexity and false alarm rate and higher detection, the proposed system had better performance as compared to the results of previously published researches. However, the incremental learning was not adopted, which is a drawback. Study in [27] reported, with high detection rate the model based on shared nearest neighbor (SNN) gave the best performance. In comparison with K-means, SNN showed better performance in U2R class when experiments were performed on reduced dataset. But in their study, full data-set was not utilized.
Through networks of NB, the ID Bayesian networks were explored with root and leaf nodes representing class and features of a connection, respectively. Investigations by doing analysis through variety of experiments in ID networks using NB applications showed better performance of Bayesian networks as compared to the top competitors of KDDCup 99 challenge in Probe and U2R categories [28]. By employing Gaussian kernels with normal distribution on estimators of Parzen-window was conducted which makes this method of estimation as a nonparametric [29]. Their model, apart from intrusion data, with ensemble of DT was relatively advantageous toward the prevailing winning entries.
NIDS based on a genetic algorithm was proposed for identification of complex anomalous behavior which simplicities the modeling of spatial and temporal information [30]. In [31], we can find the overview of ID techniques based on ensemble learning, and swarm intelligence by employing the optimization technique of ant colony and its clustering and system's optimization based on particle swarm were conducted [32]. Predominant use of descriptive statistics was shown after comparing research works in related fields.

ML-based NIDS observation
To develop NIDS, numerous ML/Deep Learning (DL) techniques have been used, which includes RF, SVM, NB, Self-Organizing Maps (SOM) and ANN [12]. To reduce features, Restricted Boltzmann machine (RBM) is used and SVM is used for classification in order to implement a NIDS [33], which gives approximately 87% accuracy of the model. In combination with generative models, discriminated RBM is used for classification which achieves good accuracy [34].
For network traffic classification variants of tree-based techniques are employed, eight in total [35]. For selection of relevant features from NSL-KDD data-set, DT algorithm is used, and for classification, RF algorithm is applied. Principle component analysis (PCA) is used for selecting relevant features, and for selecting subset of optimum features, SVM is utilized [36]. A sparse encoder is developed for feature reduction in NSL-KDD dataset in conjunction with self-taught learning in order to develop a flexible NIDS [12]. They experimented their methodology on data-set with classifier of soft-max regressor and achieved an accuracy of 92.48%. [34] concluded after experimentation that the performance is decreased if the classification is performed on different training data. Work in [35] claimed after performing experiments that the accuracy is increased, and false alarm rate is decreased if the classification is performed using random tree technique.

UNSW-NB15 data-set
UNSW-NB15 is a publicly available data-set published in [7], authors of this data, gave detailed traffic-related data and also published 10% of traffic associated with the data. A larger percentage of existing literature relates to research conducted on this 10% data. For instance, in [8], authors apply ML algorithms like RF, SVM, Multi-layer Perceptrons (MLP), Convolutional Neural Network (CNN)-1D and carry out binary as well as multiclass classifications. Results show accuracy of 89.8% and 77.8% by using CNN-1D in binary and multi-class classification, respectively.
In [37], authors used this data-set to detect DoS attacks by reducing the number of features in the data. They lowered UNSW-NB15 training set to 24596 packets and test set to 68264 packets of internet traffic, including normal and DoS attack traffic. Feature set is reduced to 27 instead of 47 by removing content features, general purpose features, time fields and nominal type features in flow and basic field class. They proposed Deep Radial Intelligence with Cumulative Incarnation (DeeRaI with CuI) to detect DoS traffic. In DeeRaI, RBF with multiple abstraction levels is used to extract the intelligence. Then, the weights are optimized by using CuI and the information extracted is passed to the next level. By using DeeRaI with CuI, they achieved best accuracy of 96.15%. UNSW-NB15 is by far the most comprehensive data-set available for testing on malicious traffic. It is for this reason that we also use this data-set for training, testing and evaluation of our proposed solution. Data parameters are given in Table 1. Each record consists of 49 features. For detailed study of features and data attributes, we encourage the reader to read [7]. We use both types of data for training and evaluation purposes.

Feature selection techniques in UNSW-NB15
Basic distribution given for UNSW-NB15 [7] is based on packets and flow features. Six categories have been defined, i.e., basic, time, content, flow, additional generated and labels. The features generated additionally are calculated from flow, basic, content and time features. It is shown that features can also be selected based on application layer protocols such as HTTP, DNS and MQTT [9]. Authors use two additional data-sets, i.e., NIMS Botnet and IoT simulation. They collect features relevant to the services from all three data-sets. ANN, NB and DT are applied along with the AdaBoost program as an ensemble process. In UNSW-NB15 data-set, accuracy achieved in DNS data records is 99.54% and in HTTP data records it is 98.97%.
Authors in [38] employ Association Rule Mining (ARM) methodology to define a relation between two or more features for selecting features with highest rank. The features are selected by comparison of UNSW-NB15 and KDD 99. The features of UNSW-NB15 are calculated in a part of KDD 99 data-set. They suggests useful attributes of most types of threats although there are certain overlapping features. Among the most repetitive attributes across all types of attacks are (1) time to live of packets from source to destination (sttl), (2) amount of rows in 100 records where srcip and dstip are same (ct_ dst_src_ltm), (3) packet count from source to destination (spkts), 4) bits per second of destination (load), (5) dropped or re-transmitted packets of source (sloss), (6) dropped or re-transmitted packets of destination (dloss), (7) number of the same srcip rows in 100 rows (ct_src_ltm), and (8) rows with same dstip and service in 100 rows (ct_srv_dst).
Authors in [39] take features derived in [38] as reference and employed few techniques of feature collection from UNSW-NB15 data-set including CfsSubsetEval with GreedyStepwise approach and InfoGainAttibuteEval with Ranker procedure to determine optimal range of features. Recommended features are then extracted from UNSW-NB15 data-set, and RF algorithm is applied using Weka [40]. Following subset of features is implemented: • type of service (e.g., web, ftp, smtp,... etc) (service) • Number of bytes from source to destination (sbytes) • time to live from source to destination (sttl) • transmitted packet size mean by the source (smean) • Number of rows of the same sport and dstip in 100 rows (ct_dst_sport_ltm) In a similar work [41], authors selected ten highest ranking features from UNSW-NB15 by employing Information Gain (IG) Ranking Filter-pre-selected in [7]. Following features were selected:

Comparison with state-of-the-art
Four classification techniques are used on these features, i.e., DT, ARM, ANN and NB for determining and discovering the roots of botnets and achieved the highest classification accuracy of 93.23% by using DT. At the cost of high computation power, the curse of dimensionality and over-fitting [8] [10] used DFEL technique to extract high-level features from the data-set. The fundamental idea of DFEL is to use a huge amount of data to generate high-level features and apply the model to boost the detecting speed of traditional ML algorithms. They applied ML techniques (GBT, KNN, DT, LR, GNB, SVM) using those features for evaluation and achieved the best accuracy of 93.13% using GBT in binary classification. Kumar et al [11] created a different number of clusters of UNSW-NB15 data-set and evaluated their efficiency using Silhouette's measure. The silhouette value is a measure of how similar an object is to a cluster. They evaluated those clusters using several ML techniques based on DT. Four variants of DT (C5, CHAID, CART and QUEST) have been used and achieved the best accuracy of 89.96% using the C5 variant with 22 features.
In most of the recent literature, researchers use only a small portion of UNSW-NB15 data-set, whereas in this work we use full data-set. Moreover, it is first time that we did feature selection according to network layer for UNSW-NB15 data-set. Contribution of this work includes imputation of missing values, Binary and multi-class Classification of malicious and normal packets using different combination of TCP (18), flow and MQTT features (13). We also did the binary and multi-class classification of malicious and normal packets using top contributing features selected from TCP and flow & MQTT features set (11) by employing three different supervised learning classifiers: RF, SVM and ANN.

Method
In this section, we discuss our proposed solution along with data-set used and ML algorithms employed for detection of malicious network traffic. Figure 1 shows an overview of steps followed in this work. At the start, we selected a data-set which is UNSW-NB15 and used it's ' .csv' files. The ' .csv' files had many issues like imbalance nature of data-set, datatype mismatch with the classification algorithm and also have many missing/null values. We resolved all these problems in the data pre-processing step. When the data got cleaned, the next step involved is creating different clusters according to network layers. Classification algorithms are applied to those clusters for prediction of network traffic as normal or malicious. Details of all the steps are as follows:

Data pre-processing
Pre-processing corresponds to cleaning the data. It involves removing redundant features, features that do not render a high IG and adding derived features-features derived from other features in data. Keeping in mind that certain ML models require details in a given format, i.e., no null values allowed in RF algorithms therefore records with null values must be removed or replaced with substitute values. This issue can be resolved using imputation. Moreover, some ML algorithms cannot process data types other than integers and floats. This compatibility issue can be overcome by typecasting the values or removing the features, that do not comply, altogether. Another important dimension of pre-processing of data is that the data should be compatible with more than one algorithms for consistency and for reducing computation complexity. The pre-processing worked out on the UNSW-NB15 dataset is as follows:

Imbalanced data-set
Imbalance refers to an unfair class allocation within the data-set [42]. Data imbalance causes the classification to be biased. In UNSW-NB15 data-set, this problem is apparent. Class distribution percentages are shown in Table 2. Normal packets comprise of above 87% of total traffic in the data-set. We use a technique similar to undersampling of imbalanced data-set to overcome this problem. We reduced number of normal packets by 50% but kept the original number of packets of other classes. The remaining data are now 60% of the actual data.

Datatype resolution
Among 49 features, there are 5 features in UNSW-NB15 data-set whose data type is nominal (other than integer / float) as shown in Table 3. We removed these features from the original data-set and are left with 44 features. Second last and last features (43rd and 44th feature) are the binary and multi-class labels, respectively. The multiclass label is also of nominal type, but during algorithm execution it is converted to integer type using factorization.

Imputation of missing values
We observed that the data-set has missing values as shown in Fig. 2. Missing data can present significant bias, making the management and analysis of the data more difficult in addition to dropping the accuracy. Features containing missing values are given in Table 4. It is evident from Fig. 2 that missing values occur predominantly in three features, i.e., ct_flw_http_mthd, is_ftp_login and ct_ftp_cmd. Also, the records having most of missing values of one feature also have missing value of one or more other features. This overlapping can be seen in Fig. 2. We had two options to overcome this problem. We could remove these samples or carry out imputation. We used imputation because removing the features would negatively impact accuracy of the solution.
These substituted values, in imputation, can come from various techniques. Following are some imputation techniques which are applied on the data-set.   In our solution, we apply all three imputation techniques to overcome the problem of missing values. We also compare the results achieved by applying these techniques.

Feature selection and extraction
Feature extraction is one of the core concepts in ML that has an enormous influence on prediction accuracy. The data features utilized to train ML models immensely determine our results. By using feature extraction, we reduce over-fitting, improve accuracy and also reduce training time. In our model, we use feature importance technique using RF. Five features were removed during data type resolution, and last two are labels (binary and multi-class). From the graph in Fig. 3a, it is observed that there are few features which are contributing most toward classification. This can cause over-fitting. To overcome this, we remove the top five features which has feature importance above 0.05, i.e., sbytes, sttl, Sload, smean, and ct_state_ttl. Together these five features have almost 70% of variance stored. The final feature importance graph after keeping 37 features is shown in Fig. 3b. Flow and MQTT and TCP features in Table 5. We applied imputation techniques, and the feature importance was calculated. It is observed that top features after applying the three imputation techniques separately are the same.

Feature clusters
In clustering, the remaining features are then clustered according to protocols. Clustering can be done according to packet and flow-based features [7] or based on a single layer services [9]. We made six clusters, i.e., flow, DNS, HTTP, File Transfer Protocol (FTP), MQTT and TCP. MQTT has been developed as a basic message protocol for equipment with restricted bandwidth. It is therefore, ideal for IoT applications. MQTT allows transmission of instructions through sensor nodes and monitors performance. Resultantly, connectivity among various devices is simple to achieve. Features in flow and MQTT are shown in Table 5. We have combined them to create a single cluster. TCP features are shown in Table 5. The third cluster of top features was extracted from flow and MQTT and TCP clusters and the features in this cluster are also shown in Table 5. Feature importance was calculated and top five and top six features from flow / MQTT and TCP were kept, respectively. The feature importance graphs of flow/MQTT, TCP and top features are shown in Fig. 4a-c, respectively.

Classification algorithms
We use supervised learning in this work, used binary and multi-class classification on the data-set and techniques include RF, SVM and ANN. In RF, multiple DT are used to make predictions on data. Each tree predicts a class and the class predicted by most trees becomes our model prediction. SVM is characterized as a split hyper plane. It is a discriminating classifier. It provides an efficient way to classify new instances of data based on the location relative to the division (or split). The motivation for ANN is biological which comes from the nervous system structure. It is built on the lines of brain functionality. In mammalian cerebral cortex, the way the neuronal system works is the structural basis of ANN but on a much reduced level.

Parameter tuning
In RF and SVM, we applied GridSearchCV on the data-set on important hyper-parameters. In RF, we applied GridSearchCV on n_estimators, criterion, max_features and min_ samples_split and found out n_estimators = 100, criterion = gini, max_features = auto and min_samples_split = 5. In SVM, we found penalty = l1, loss = squared_hinge, tol = 10 -12 and C = 10 10 to the most optimal hyper-parameters. We use Intel (R) Xeon (R) CPU E3-1285 v6 @ 4.10 GHz with 8 CPUs, 4 cores per CPU and 2 threads per core. The system runs Ubuntu 18.04.1 LTS and has 64 GB of RAM. We also use Google Colaboratory (Colab)-free to use platform for execution of DL algorithms with pre-installed libraries. By default, it provides 12 GB of RAM and a GPU as well.

Results
As mentioned earlier, we performed both binary and multi-class classifications. We used reduced data-set, i.e., flow and MQTT features, TCP features and top features from flow and TCP clusters. These experiments are conducted on data with three imputations as discussed earlier. With the help of parameter tuning, the optimal parameters are computed which are used for binary and multi-class classification. We applied binary classification only on the full data-set and multi-class classification on full data as well as layer clusters.

Binary classification
In binary classification, we applied RF, SVM and ANN on optimal hyper-parameters found during parameter tuning phase. Accuracy results achieved are shown in Fig. 7a.
Here, we can see that the best result in terms of accuracy score is 98.67% obtained by applying RF with mean imputation. In SVM, the best accuracy score is 97.69% also achieved with mean imputation. In ANN, the best accuracy score is 94.78% achieved with multiple imputation. Confusion matrices are shown in Fig. 5. Overall, we can see that accuracy achieved using various imputation techniques is not significantly different. The mis-classification is more significant in Fig. 5g-i, the reason is the simplicity of the model used which is verified by the confusion matrices of RF and SVM. The other reason is that the data contain too much noise in some samples, in that case a weak classifier finds it difficult to classify those samples correctly. The confusion matrices shows the heat map of accuracy achieved by each classifier. From Fig. 5, we can see that the highest accuracy in terms of true positives and true negatives is achieved by RF followed by SVM and ANN. In ANN, we can see that percentage of mis-classification of malicious traffic with normal traffic is around 15%.

Multi-class classification
In multi-class classification, we performed experiments on reduced data-set with full features and cluster-based features and used all imputation techniques used in binary classification. All three ML algorithms are used for evaluation. Multi-class classification results using reduced data-set are shown in Fig. 7b. We achieved highest accuracy of 97.37% by applying RF on regression imputed data-set. With SVM, we achieved 95.67% accuracy, and with ANN, we achieved 91.67% accuracy. The confusion matrices for these experiments are shown in Fig. 6. Overall, from Fig. 7 we can see that RF outperformed SVM and ANN in both categories.
From the confusion matrices, we can see that the general trend of classification along the diagonal is very good except for few classes where the mis-classification accuracy is a bit higher. Mainly, we can see that the DoS packets have been mis-classified as exploits in all three algorithms. The same goes for backdoor and analysis class. This mis-classification is due to the imbalance nature of data-set. We have tried removing this issue of data-set but since the variation in available number of samples of each class is higher, therefore after even resolving this problem we still are left with resolving it for classes with fewer samples. This issue causes a certain degree of overfitting which is affecting the classification of those classes.
The comparison graphs of accuracy of all ML algorithms using feature clusters are shown in Fig. 8. We achieved highest accuracy of 96.96%, 91.4% and 97.54% in flow / MQTT, TCP and top features using RF.

Comparison with state-of-the-art
The most relevant-related research in the domain is presented in [8]. Authors have applied various ML algorithm including RF, SVM and MLP. They have also conducted binary as well as multi-class classification. In binary and multi-class classification, they achieved highest accuracy of 89.8% and 78.2% by using CNN-1D.
Authors in [9] created DNS and HTTP feature clusters and applied various classification algorithms. They considered complete data-set and extracted only the packets relevant to the cluster feature set. They applied ensemble technique and achieved highest accuracy of 99.54% in DNS and 98.97% in HTTP. Similarly, DFEL [10] used six different algorithms and achieved best accuracy of 93.13% using GBT in binary classification.
Four variants of DT (C5, CHAID, CART and QUEST) have been employed in [11]. Authors reduced the features set using IG and classified using varying number of features. They achieved the highest accuracy of 89.86% using 22 features with C5 variant.
It is evident from our results that our proposed solution achieved better accuracy in both classifications. The comparison is presented in Table 6.

Discussion
In multi-class classification on reduced data-set with full features, the three classes, i.e., DoS, backdoor and analysis, show a high false alarm rate and all three classes are mis-classified as exploits, as shown in Fig. 6. There are two main reasons behind this anomaly. First, there is a hairline difference when the feature values of these classes are compared, which makes it very difficult for ML algorithm to classify it correctly. Second, moreover, even by understanding the classical definition of all these four classes, it is very difficult for a human being to understand the difference between all these classes. However, majority of packets of these classes are classified correctly. This trend can generally be observed in all ML algorithms.
Results from Fig. 8 show that the accuracy score variation does not come from changing imputation technique but from changing feature clusters. The reason is that the most  , C is class, i is the label which is 0 or 1 and j shows that 0 is normal(N) and 1 is malicious(M) packet. In RF and SVM, there is very less % of false alarms but in ANN that % is slightly higher which is mostly due to the malicious traffic being classified as normal Fig. 6 Multi-class Classification Confusion Matrices. a-c are of RF with mean, multiple and regression imputation, d-f are of SVM with mean, multiple and regression imputation and g-i are of ANN with mean, multiple and regression imputation. In C i j , C is class, i is the label which is from 0-8 and j shows that 0 is normal and 1 is exploits (E), 2 is reconnaissance(R), 3 is DoS(D), 4 is generic(G), 5 is shellcode(S), 6 is fuzzers(F), 7 is backdoor(B) and 8 is analysis(A) packet relevant and important features in any packet belong to flow category. When the top features from both categories are combined together, the accuracy increases. This trend can also be due to the fact that two features with missing values as shown in Table 4 requiring imputation are in TCP cluster ( Table 5) and none of those features are in flow cluster (Table 5). Confusion matrices for RF are shown in Fig. 9.
In our solution, we remove the imbalance in the data-set, but to keep the training quality of algorithms we have not reduced the malicious packets instead, we have reduced the number of normal packets to remove the biased classification. Generally from Figs. 6 and 9, it can be seen that with more training examples of a class the accuracy increases. The diagonal values show promising accuracy in most cases but by using transport features only we can see that the mis-classification of three classes DoS, backdoor and analysis with exploits. This can be due to the reason that the features used for these types of attacks falls in flow category.  We have also removed certain issues pertaining to the full data-set which are curse of dimensionality, over-fitting and imbalanced data by removing few features and also by reducing the data-set. We used various imputation techniques to substitute missing values in data. The effects of using remaining features and using clusters are shown in terms of accuracy by evaluating the results using multiple ML algorithms. Overall, in binary classification on full data-set we achieved highest accuracy score of 98.67% with RF using mean imputation. In multi-class classification, the highest accuracy score using full data-set is 97.37% with RF using linear regression imputation. In clusters-based classification, RF seemed to outperformed other ML algorithms by achieving 96.96% in flow features, 91.4% in TCP features and 97.54% in top features from flow and TCP clusters.
In the future, in order to increase the profiling accuracy of the patterns adopted by malicious traffic, more focus will be on the collection of appropriate features related to other IoT protocols. By the use of suggested methodology and with the collection of relevant features, the detection accuracy of known and unknown attacks will increase. Moreover, other data-sets will also be analyzed using these techniques.