IDSDL: a sensitive intrusion detection system based on deep learning

Device-free passive (DfP) intrusion detection system is a system that can detect moving entities without attaching any device to the entities. To achieve good performance, the existing algorithms require proper access point (AP) deployment. It limits the applying scenario of those algorithms. We propose an intrusion detection system based on deep learning (IDSDL) with finer-grained channel state information (CSI) to free the AP position. A CSI phase propagation components decomposition algorithm is applied to obtain blurred components of CSI phase on several paths as a more sensitive detection signal. Convolutional neuron network (CNN) of deep learning is used to enable the computer to learn and detect intrusion without extracting numerical features. We prototype IDSDL to verify its performance and the experimental results indicate that IDSDL is effective and reliable.

interference and less sensitive to different motions of human body [8], passive detection of moving humans with dynamic speed (PADS) [9] and phase mode (PM) [10] use finergrained CSI in passive intrusion detection. In contrast, CSI is stable to static environments and sensitive to dynamic environments, so it is a better detection signal.
Previous passive detection algorithms extract features such as mean and variance [11,12] from RSS and CSI data to accomplish the function. Their core is to determine whether there is a moving human in the monitoring area or not. Although the algorithms work effectively and accurately [10], they usually detect human motion that occurs on line-of-sight (LOS). Therefore, those algorithms require the AP to be placed near the doorway or other places where human motion will directly influence CSI near LOS path. This makes the effectiveness of the whole system compromised when the intruder does not pass through the LOS path.
To decrease the high requirement for AP deployment, we use CSI phase as the source signal, extract propagation components on independent paths as the detection signal, and design a CNN to construct IDSDL. Then the performance of this system is verified on commodity WiFi devices. The result demonstrates that IDSDL is able to detect human motion on non-line-of-sight (NLOS) path effectively.
The rest of this paper is organized as follows. We review related work in Sect. 2. Preliminaries about CSI and convolutional neuron network (CNN) are mentioned in Sect. 3. We introduce the methodology in detail in Sect. 4. Experiment settings and performance evaluation are elaborated in Sect. 5. Conclusions are drawn in Sect. 6.

Related work
The idea of passive intrusion detection is to judge the monitoring area is intruded or not by analyzing human's impacts on wireless signal.
When the wireless signal passes through human body, received signal strength (RSS) undergoes certain changes and fluctuations so that it could be used in passive detection. Youssef and Moussa [6,13] introduced the concept of device-free detection and proposed a maximum likelihood estimation-based algorithm (MLE) to improve the performance of device-free passive (DfP) system in real-world environments. The most well-known RSS-based motion detection, Radio Tomography Imaging (RTI) [11], which deployed a wireless sensor network in the target area to measure RSS to obtain an image of the moving people was proposed. On the foundation of RTI, variance-based radio tomography imaging (VRTI) [12] used the changes in RSS variance caused by people motion for motion detection. RASID [14] system used RSS standard deviation to improve detection accuracy. Since RSS is too sensitive to tiny changes in the environment, a device-free localization (DfL) [15] algorithm based on differential RSS, was proposed to overcome RSS's negative impact on environment.
Due to the coarse granularity and limited precision of RSS, people have cast their sights on fine-grained CSI. CSI was firstly applied in location technology. Pilot [16] tried device-free localization, using CSI and time correlation to detect anomalies in the channel and then locate the object. Apart from the location technology, CSI start to be utilized in motion detection. Fine-grained Device-free Motion Detection (FIMD) [17] achieved the accurate detection of sudden motion by using the time stability of CSI in static environment. Since most studies only make effective use of CSI amplitude, PADS [9] used a method that sanitizes CSI phase, then uses both CSI amplitude and phase in detection. PM [10] utilized a method based on PADS to extract the phase difference between adjacent antenna pairs as the detection signal for passive intrusion detection.
Previous researches extract certain numerical features from CSI signals for passive detection and require the AP to be arranged in proper places. Therefore, when human motion occurs on NLOS paths, their effectiveness would be reduced. As a result, if there is a person intrude from the door and the APs are placed away from the door, detection effect is not satisfactory. To overcome this disadvantage, we propose IDSDL, an algorithm that is able to sensitively detect human motion on NLOS.

Preliminaries
In this section, we briefly introduce necessary information about CSI and analyze the impact of multi-path effect on CSI.

Channel state information
In the wireless communication system, CSI describes the channel properties of the communication link in the subcarrier level and represents what a signal undergoes while passing through the subcarriers. In the orthogonal frequency-division multiplexing (OFDM) system, after passing through the multi-path channel, the received signal can be represented as: X and Y are signal vectors of the transmitting terminal and the receiving terminal, respectively, N represents additive Gaussian white noise in the channel and H represents a channel matrix. H is shown as follows: where − → H (f n ), n ∈ [1, N ] is the channel frequency response (CFR) in the frequency domain on each subcarrier and N is the total number of subcarriers.
The channel frequency response can also be expressed as where a i and θ i + 2π f τ i are the CSI amplitude and phase on the i th subcarrier, respectively, f is frequency, N is the total number of subcarriers and τ is the delay.

Difference between human motion on LOS and NLOS
The received CSI is a multipath [18] synthesized signal, and the CSI will undergo different changes when the signals on different paths are interfered. Assuming that the wireless signal propagation path in space is shown in Fig. 1, the intrusion from the door will only affect a small number of CSI signals on NLOS paths, while the LOS path is not affected. Numerical features used in previous researches such as mean, variance and covariance would be large when human motion happens on LOS. But these values would not be that (3) obvious when human motion happens on NLOS, especially on paths far away from LOS. Therefore, simple numeral features would not be effective. As shown in Fig. 2, we calculate the largest six eigenvalues of covariance matrix extracted from CSI phase, which is the feature utilized in detection in the literature [9]. It is obvious that the values obtained when human motion occurs on NLOS far from LOS are not much larger than those got when the monitoring area is clear, while the values of human motion on LOS is an order of magnitude larger. To solve this problem, we need a new kind of detection feature. We first use path decomposition to decompose the CSI signal to analyze the changes of the CSI signal on a certain independent path. The path-decomposed CSI signal can be used as a sensitive signal for intrusion detection because it can sensitively detect the change of the channel when the LOS signal does not change so much. Then we convert the signal into feature image instead of extracting numerical features because feature image retains not only the waveform characteristics of signals but also the timing characteristics of signal changes if we convert the signals of multiple consecutive moments into one feature image. This allows CNN to judge the occurrence of intrusion behavior based on signal changes at multiple times.

Convolutional neuron network
Deep learning is being used more and more in wireless communication networks, and it is commonly used to solve optimization problems in wireless communication networks. Chen et al. [19] formulates the joint learning, wireless resource allocation, and user selection problem as an optimization problem whose goal is to minimize an federated learning (FL) loss function that captures the performance of the FL algorithm. Yang et al. [20] investigates the problem of energy efficient transmission and computation resource allocation for FL over wireless communication networks. Wang et al. [21] studies the problem of optimizing the deployment of unmanned aerial vehicles (UAVs) equipped with visible light communication (VLC) capabilities.
In addition to optimization problem, deep learning is also used to solve classification problem. CNN is a neural network model that is useful for classifying images. CNN evolved from multi-layer perceptron (MLP) [22]. LeCun et al. [23] combined the idea The using process of CNN is as follows: (1) Create a dataset consisting of annotated images, or use an existing dataset. Annotations can be image categories (for classification problems), bounding boxes and classes (for object detection problems), or pixel-level segmentation for each object  of interest in the image (for instance segmentation problems). In our manuscript, the annotation is image category because we consider the intrusion detection as a classification problem.
(2) Features related to the current task are extracted from each image. This is the key point of modeling. For example, the features used to identify human faces are significantly different from those used to identify tourist attractions. Fortunately, the feature selection and extraction are learned automatically during the training process. (3) Train a CNN model. Training means providing many images to a CNN model, which will help CNN model learn how to solve the task at hand by using the features extracted from training images. (4) Evaluate the model by utilizing images not used during the training phase. By doing so, we test the effectiveness of the trained model.
Due to its characteristics of local connection and weight sharing, CNN is more suitable for processing image data. Weight sharing enables information to be shared between neurons and reduces the number of parameters significantly. Local connection is different from full connection. Local connection is that the neurons in layer n-1 only have connections with some neurons in layer n, instead of all of them, which reduces the probability of overfitting.
CNN consists of convolutional layers, pooling layers, and fully connected (dense) layers. Each convolutional layer contains multiple feature maps. Each feature map is a "plane" composed of multiple neurons, which extracts features of the input through a convolution filter. The sampling layer is also called the pooling layer, and its role is to perform subsampling based on the principle of local correlation, so as to reduce the amount of data while retaining useful information. By applying the convolutional layers and the sampling layers, CNN maps the original image into a feature vector and finally completes the recognition task in fully connected layers and the output layer. CNN's construction method of multi-hidden layer stacking, that makes each layer of CNN processes the output of the previous layer, can be regarded as that CNN processes the input signal layer by layer. The initial input representation that is not closely related to the output target is changed into a representation that is more closely related to the output target, making it possible to perform tasks that were difficult to complete based only on the last layer of output mapping previously.
CNN is widely used for feature extraction of pictures due to its special network construction and excellent performance. It can not only automatically extract features, but also have strong generalization capabilities.

Methodology
IDSDL mainly includes four parts: data preprocessing, path decomposition of CSI signal, CNN classification and motion detection, shown as Fig. 3. The system monitors CSI in the WiFi wireless network space, uses path components to reflect CSI changes sensitively and classifies the data by utilizing CNN to carry out motion detection. The system contains offline training and online detection. The CNN classifier is trained in the offline phase and new CSI data is detected in the online phase.

Data preprocessing
(1) Phase sanitization Due to the excessive offset error [24,25] included in CSI phase, phase sanitization is required to effectively utilize CSI phase [9]. The measured phase P ci can be linearly transformed to obtain raw phase P zi : Compared with P ci , P zi eliminates the random phase offset and has effective features necessary for intrusion detection in true phase simultaneously.
(2) Outlier removal Due to the influence of random noise, outliers appear in the measured CSI data. In order not to affect the performance of the system, filter is used to remove outliers.
Hampel filter [26]: All the data that do not belong to the interval [ µ − γ σ , µ + γ σ ] are regarded as outliers supposed to be removed, where µ is the median, σ is the absolute deviation of the median data sequence, γ is a parameter usually set to 3. Kalman filter [27]: Kalman filtering is a state equation of a linear system. Its basic idea is to use the input and output of the system to estimate the state of the entire linear system. Since the noise signal is doped in the measured data, the process of this optimal estimation is also regarded as a filtering process.
No matter which filtering method is adopted, outlier removal will improve system performance by reducing some unexpected errors. Figure 4 shows the filter results obtained by using Hampel filter and Kalman filter.
a Outlier removal with Hampel filter b Outlier removal with Kalman filter Fig. 4 Outlier removal

Path decomposition of CSI phase
According to the analysis in Preliminary, human motions on LOS and NLOS cause different changes in CSI. To extract a detection feature that is not overly affected by the LOS signal, we propose a path decomposition algorithm. The path decomposition is used to decompose the CSI signal for analyzing the changes of the CSI signal on a certain independent path. The path-decomposed CSI signal can sensitively detect the change of the channel when the LOS signal does not change so much. Therefore, it can be used as a sensitive signal for intrusion detection. Let CSI k denote the CSI on k th subcarrier, CSI 0 denote the CSI on the zeroth subcarrier, which is a virtual subcarrier whose frequency is f 0 . For path l, we introduce two symbols S l 0 and l k : where α l is the amplitude of CSI k on path l, f 0 is the frequency of CSI 0 , k f is the frequency difference between f 0 and f k , and τ l is the delay of CSI on the path l . Then, CSI k can be expressed as: In the 802.11n standard, when BW = 40 MHz and Ng = 4 (Grouping), the index of the sampled subcarriers ranges from − 58 to 58 where the interval of subcarrier index is 4. For convenience of presentation, suppose that there are five paths, we can construct a Hankel matrix: Combined with formula (5) and (6), X could be expressed as: where The problem of solving path components has turned into a mathematical problem. We define = VS ′ , where (5) then X can be expressed as: since S ′ = S ′ T , X is also presented as: where It is easy to prove: Obviously, V is a Vandermonde matrix. Therefore, the problem transforms into a problem of finding a Vandermonde decomposition of X , of which the solution 1 is components of CSI phase on 5 paths. It is well known that the number of paths in multipath effects is actually very large. Here we only assume that there are five paths and solve the CSI phase components on each path. We summarize the overall in Algorithm 1. We apply the algorithm on P zi , and the path-decomposed CSI phase is shown in Fig. 5.
We convert the path-decomposed CSI signal into feature image as the input of CNN classifier by converting the data of the CSI matrix into gray pixels. Feature image retains not only the waveform characteristics of signals but also the timing characteristics of signal changes if we convert the signals of multiple consecutive moments into one feature 1  1  1  1  1   1  4  2  4  3  4  4  4  5  4  1  8  2  8  3  8  4  8  5  8  1  12  2  12  3  12  4  12  5  12  1  16  2  16  3  16  4  16  5 image. Using these image inputs helps CNN detect the occurrence of intrusion behavior based on signal changes at multiple times.

CNN classifier
CNN is a classic deep learning model with good effect in the field of image processing and computer vision [28][29][30], which works pretty well on image classification tasks. An advantage of CNN lies in feature extraction [31]. Because the feature detection layer of CNN learns features from training data directly, it avoids explicit feature extraction, thus reducing the time cost on finding suitable numerical features. Using this CNNbased method and feature image improves the detection system. On the one hand, the feature image converted from the path-decomposed CSI signal retains the waveform characteristics of signals and the timing characteristics of signal changes. On the other hand, CNN effectively solves the problem that suitable numerical features for sensitively detecting human motion on NLOS are laborious to construct due to its ability to learn features from training data directly.
A CNN is designed as Fig. 6. The network uses three layers of convolution layer to extract the features of the input image, two max-pooling layers to scale down the size of feature maps and three dense layers to make classification.
The function of convolution operation is to determine a feature in the image, condense original image pixels of the convolution kernel size into one pixel, and extract features in the image after convolution operation. Large-sized convolution kernels can provide a larger receptive field, but meanwhile the amount of parameters is also larger, leading to more calculations. So multiple smaller convolution kernels are used in the second and third convolutional layers to ensure large receptive field while reducing the number of parameters. The max-pooling operation highlights the largest feature of 2 × 2 pixels, which is able to provide strong robustness and reduce the number of parameters. The softmax function is used in the third dense layer. The softmax function is a gradient logarithm normalization of a finite item discrete probability distribution which is suitable for multi-classification tasks.
In detail, the first convolutional layer has sixteen 5 × 5 convolution kernels with convolution step of 1, and due to zero padding, it outputs L × 30 × 16 feature map. The second convolutional layer owns 32 3 × 3 convolution kernels, and convolution step is also 1. The parameters of the third convolutional layer are consistent with those of the second one. Two max-pooling layers are after the second convolutional layer and the third convolutional layer, respectively. At this point, a flatten layer is performed to flatten 2-dimensional feature maps into 1-dimensional ones, connecting the convolutional layers and the dense layers. After three dense layers, the output is the prediction result we need.
Because CNN has a more complicated learning mechanism than conventional learning methods, our method does not have much advantage in terms of time complexity. For a single convolution layer, the time complexity is O( Where M is the size of output feature map, K is kernel size, Cin is number of input channels and Count is number of output channels. The time complexity of the entire neural D is the number of convolutional layers of the neural network. l represents the lth convolutional layer. C l indicates the number of output channels of this layer. C l−1 represents the number of output channels of the previous layer, that is, the number of input channels of this layer.
We have to train the CNN in the offline phase to make the model learn to identify different samples of static state and dynamic state. After CNN training, the network obtains the ability to classify CSI data correctly. We use the trained CNN to predict new data in the online phase.

Motion detection
Since the input of CNN is generally images and feature extraction is not required, we directly convert path-decomposed CSI phase data into images, which are also known as feature maps. At this point, after preparing a classifier and its input, the structure of our system is complete. We use the CNN to classify the input into data of static state and data of dynamic state. In motion detection, sample selection would make an affect on the performance of the trained network. To eliminate the contingency of sample selection, fivefold cross-validation is necessarily implemented in motion detection. Firstly all the feature maps are divided into five parts; secondly, one of them is selected as the validation set each time, while the other four are used as training set; thirdly, we conduct experiments with different validation set five times. After doing so, the five results are averaged to obtain one final experimental result. Take scenario 1 as an example, the five results are shown in Fig. 7 and specific detection accuracies are shown in Table 1.
The average TN rate with no human and TP rate with human walking are 97.18% and 97.79%, respectively.

Experimental scenario
To evaluate the performance of IDSDL, experiments are conducted on commodity WiFi devices. Specifically, we use a TP-LINK TL-WR886N router as the transmitting terminal, as well as a computer equipped with the Intel WiFi Link 5300 NIC [32,33] and CSITool as the receiving terminal. Both the router and the receiving antenna are set at 1.2 m high. We collect CSI data in a laboratory with multiple propagation paths, in which a variety of electronic devices and multiple sets of tables and chairs are placed.
In the experiments, CSI data of dynamic state is collected with an adult intruding from the door in the single participant scenario, and three adults intruding and moving in the area in the multiple participants scenario. The data of static state is collected with human being absent in the area. To verify our system possesses good effect on detecting both human motion on LOS and NLOS, experiments of four different RX-TX position scenarios are implemented.  Each scenario is shown in Fig. 8. The main moving area of experiment participants is around the conference table. Table 2 presents the parameters of experiment.

Evaluation metrics
We mainly use the two metrics for evaluating the performance of IDSDL. ① True negative rate (TNR) the probability that no human presence is correctly identified. ② True positive rate (TPR) the probability that a moving human presence is correctly classified. ③ Detection accuracy the probability of identifying the environmental changes in different scenarios.

Overall performance
Firstly, we conduct our experiments in different scenarios to examine the overall performance of our system. Figures 9 and 10 show the performance of motion detection with four RX-TX positions, where Fig. 9 is the detection results of one participant participating in the experiment, as well as Fig. 10 shows the detection results of multiple participants participating in the experiment. From these two figures, we can see that the TN rate and TP rate of scenario 2 and scenario 4 are higher because their RX-TX positions make the intrusion happen on paths that are close to LOS. TN rate and TP rate are a little lower because intrusion occurs on NLOS paths which are pretty far away from LOS in scenario 1 and scenario 3. Another conclusion we can draw from the detection results is that motion detection is more accurate when multiple participants appear in the experiments than when there is only one participant. The reason is that compared with single person participating, human motion of multiple participants will have a greater impact on CSI by affecting CSI on multiple paths simultaneously. The detection results obtained by using PADS and IDSDL in four experimental scenarios are shown in Fig. 11. The results indicate that IDSDL is more sensitive than PADS in scenarios in which human motion occurs on NLOS paths. The detection accuracy results obtained by using CNN, support vector machine (SVM), random forest (RF), naive Bayes (NB), and k nearest neighbor (KNN) in four experimental scenarios are shown in Fig. 12. Detection with conventional learning methods works when human motion occurs near LOS but cannot achieve good detection effects when human motion occurs on NLOS paths. However, the results indicate that intrusion detection with the CNN-based method is more sensitive when human motion occurs on NLOS paths that are far away from LOS path.
In summary, IDSDL can not only obtain good detection effect when human motion occurs on LOS path, but also works well when human motion occurs on NLOS paths. Thus, IDSDL is a sensitive intrusion detection system.

Impact of sliding window size
Intuitively, in terms of normal walking speed, an intrusion process lasts 5-6 s from the time the door is opened. When the size of sliding window is large enough to contain a complete intrusion process, the detection accuracy is better. Experiments are carried out by setting the window size to 5, 10, 20, 30, 40, and 50. Figure 13 shows the change of TN rate and TP rate with different window sizes. The detection accuracy is improved with the sliding window size getting larger because the CSI data intercepted by larger sliding window contains more CSI changing features. However, it is not a panacea. When the size increases to a certain threshold, the rising detection accuracy tend is stalled.

Impact of window sliding step
Window sliding step affects the number of times the image features repeated on the image samples, and indirectly affects system performance. This impact is mainly Since the sliding step is short, the same CSI sequence data can generate more feature map samples, which leads to more comprehensive acceptable features, so that the highest TP rate and TN rate reach 99.51% and 99.43%, respectively. However, from Table 3 we realize that a larger number of samples leads to a larger training time consumption as well as increases the maximum detection accuracy. So we can be certain that if computer performance is good enough and time is not limited, short sliding steps are better choices, and if it is to consume shorter time while pursuing good detection performance, the sliding step of 5 is better. Taking the time consumption with the sliding step of 5 as an example, the total average time consumption during training phase is about 2.5 s per sample, while the processing time for a single test sample is about 10 ms, which is a short response time.

Impact of intrusion distance
We define some intrusion distances to verify the sensitivity of the system to human motion on the NLOS path. In the experiment, intrusion distance of 0.2, 0.4, 0.6, 0.8, and 1.0 m are set. An adult opens the door and enters for those preset distances. The  experimental result is presented in Fig. 15. We can see from the figure that our system is sensitive to this kind of human motion far away from the LOS path. Since we use propagation components of CSI phase as detection signal and our algorithm can only obtain components on five paths, detection accuracy is not getting higher as the intrusion distance is larger.

Results and discussion
In this paper, we design and implement a WiFi sensing system for detecting intrusion by using CSI in the physical layer of WiFi network as detection signal. In this paper, we utilize a path decomposition algorithm and CNN to improve the sensitivity of passive intrusion detection system. The path-decomposed CSI phase can more accurately reflect the human motion occurring on NLOS paths. The application of CNN improves the performance of the detection system as well as simplifies construction of feature engineering. The proposed system solves the problem that human motion happens on NLOS paths is not well detected. Multiple experiments are conducted to verify the performance of IDSDL. The average detection accuracy in four different scenarios is 98.69% with single participant, and 98.91% with multiple participants. Compared with previous algorithms, IDSDL can sensitively detect human motion on NLOS paths and improve system reliability.