This section reviews background information related to CS, sparse recovery, and ML approaches.

### Compressive sensing and sparse recovery

Using a random sensing matrix, CS merges data measurement and compression into a unified operation. CS applies to compressible signals, i.e., either the explicitly sparse signals or the ones admitting sparsity in a certain domain [26].

Let us assume a signal vector \(\boldsymbol {y} \in \mathbb {C}^{N}\). A compressed version of *y* can be obtained by applying a measurement matrix \(\boldsymbol {\Phi } \in \mathbb {C}^{M\times N}\) as *y*_{c}=*Φ**y*, where *M*≪*N*. Hence, a reduction in dimensionality from *N*-to-*M* is achieved. A high-dimensional version of the original signal can be reconstructed from this low dimensional measurement via sparse recovery [26].

Generally speaking, let us assume that a signal *y* admits sparse coding over a dictionary \(\left (\boldsymbol {D} \in \mathbb {C}^{N \times K}\right)\). The signal can be represented in terms of *D* as *y*≈*Dw*, where \(\boldsymbol {w}\in \mathbb {C}^{K}\) is a sparse coefficient vector. The calculation of *w* can be cast as follows.

$$ {\arg\min\limits}_{\boldsymbol{w}}{\left\|\boldsymbol{y}-\boldsymbol{Dw}\right\|}_{2}^{2} ~ s.t. ~ {\left\|\boldsymbol{w}\right\|}_{0} <S, $$

(1)

where *S* denotes the sparsity level of the signal. Sparse recovery is an NP-hard problem. However, sparse recovery methods offer efficient approximate solutions. As shown in (1), the *ℓ*_{0} pseudo-norm is principally used to exactly quantify the sparsity level. However, its minimization is mathematically intractable and highly complex. Therefore, there exist only approximate solutions to *ℓ*_{0} minimization, such as the matching pursuit and orthogonal matching pursuit (OMP) approaches. Alternatively, this problem can be overcome by relaxing the *ℓ*_{0} norm minimization condition to minimizing the *ℓ*_{1} norm which is a loose bound on sparsity. Still, *ℓ*_{1} minimization is convex and accepts linear programming. Thus, replacing *ℓ*_{0} minimization with *ℓ*_{1} minimization offers a significant reduction to the computational complexity of sparse coding. However, *ℓ*_{1} minimization requires information about the noise level of the signal being recovered. Thus, in this work, we adopt approximate *ℓ*_{0} minimization through the OMP algorithm.^{Footnote 1}

The intrinsic sparsity of the signal can be revealed by a dictionary. This dictionary can be formed of fixed basis functions such as Fourier basis, Gabor functions, wavelets, and contourlets. Alternatively, it can be generated as a learned dictionary. In this setting, a dictionary is obtained by training over training data signals \(\boldsymbol {Y} \in \mathbb {C}^{N\times L}\) [28]. This dictionary learning process can be formulated as

$$ {\arg\min\limits}_{\boldsymbol{W, D}} {\left\|\boldsymbol{W}_{i}\right\|}_{0} ~ s.t. ~ {\left\|\boldsymbol{Y}_{i}- \boldsymbol{DW}_{i}\right\|}_{2}^{2} < \epsilon ~\forall~ ~i, $$

(2)

where *ε* represents error tolerance. Since the problem is non-tractable and non-convex, most of the dictionary learning algorithms perform the learning by iteratively alternating between a sparse representation stage and a dictionary update stage. As an example, the K-SVD algorithm [28] is one of the widely used algorithms for the dictionary learning process.

The abovementioned dictionary learning is a computationally demanding process. Therefore, developing efficient alternatives to the classical dictionary learning approach is needed for CR-related applications [21]. In this context, the use of sampled dictionaries is an efficient alternative. One can obtain a sampled dictionary by picking a set of randomly selected data vectors that serve for the sparse coding without the need for applying an expensive learning process. Thus, this offers a compromise in terms of computational complexity at a tolerable loss in the representational power of the dictionary. In [22], the use of sampled dictionaries is justified by their usage to represent data points in a specific class, which have a general similarity. Similarly, sampled dictionaries are used in this work to represent signals.

### Residual components in pursuit sparse coding

A widely used sparse representation algorithm is OMP. This algorithm is based on iteratively obtaining the coefficients in a sparse coefficient vector (*w*). Particularly, each iteration identifies the location and adjusts the value of a nonzero element in *w*. This is achieved by selecting one atom (column) from a dictionary *D* and adjusting its respective weight.

To implement the above-explained atom selection and coefficient update processes, algorithms such as OMP define a so-called residual signal *r*. Conceptually, *r* represents signal portions that have not yet been represented by the selected dictionary atoms. Hence, sparse coding initializes *r* with the signal itself, as *r*←*x*. In the first iteration, the sparse representation algorithm loops through all dictionary atoms and selects the one most similar to the current residual *r*. Once this atom is selected, the corresponding weight is calculated. To this end, the next residual is calculated by subtracting the resultant one-atom sparse approximation from the original residual. Then, the residual is considered as a new signal for which another dictionary atom is selected and another coefficient is calculated and the process continues until a certain halting condition is met.

The interesting point to consider in the above-explained sparse coding approach is that the energy of the residual components should dramatically decrease as sparse coding progresses. Intuitively, this is because more atoms are selected, and thus, more signal portions are excluded from the residual.

### Machine learning for classification

The successful works of the ML algorithms in many application areas such as computer vision, fingerprint identification, image processing, and speech recognition led these algorithms to become appealing for the area of wireless communication [29]. These ML algorithms are categorized under three categories called supervised, unsupervised, and reinforcement learning. Supervised learning-based ML algorithms are widely used for classification problems when the number of present classes is known and the information of the classes that samples belong to in the training stage is available.

Amongst many supervised learning-based algorithms, the feed-forward neural network has received growing interest in classification problems since it can recognize classes accurately and quickly [30]. This network can be used with a single-layer and multi-layer. Although single-layer algorithms are computationally good, these algorithms can only be used for simple problems. Alternatively, the multi-layer-based algorithms that include the usage of one or more hidden layers are used. Even though these algorithms increase computational complexity, they are able to solve more complex problems. Besides the effect of the extra layers, the number of neurons that are used in hidden layers is also effective on the accuracy and complexity performances. Therefore, it is quite significant to set these hyper-parameters optimally. Moreover, the complexity and accuracy performances can be increased by feature extraction (with the domain knowledge). Along this line, CS is used to extract features in this work with the aim of increasing the performance of the ML.

### System model

The system model used is intended to characterize the existence of legitimate and illegitimate source nodes. Thus, it consists of a PU node, an SU node, and an illegitimate node as presented in Fig. 1. In this setting, an SU node opportunistically exploits the spectrum in the presence of an illegitimate node that can launch either PUEA or jamming attack. A jammer transmits a random signal, while a PU node and a PUE transmit structured signals that mimic the legitimate PUs.

We can represent the transmitted signal as *x*=*A**s*, where *A* is a coefficient matrix with a size of *N*×*N*. Each component is denoted by *a*_{i,j} with *i*,*j*=1,…*N*, and *s*=[*s*_{1}(*t*),…,*s*_{N}(*t*)]^{T} represents the transmitted data vector. Any coordinate of *s* is given as \(s_{i}(t)= \sum ^{\infty }_{k=-\infty } d_{k} u(t-k T_{s}) e^{j2\pi f_{c,o}t}\), where *T*_{s} is the symbol duration, *f*_{c,o} represents the center frequency, *d* represents digitally modulated data symbols, *u*(*t*) represents the pulse shaping filter, and *o*=1,2,…,*N*.

The signal at the receiver sent by any node can be written as

$$ \boldsymbol{y}=\boldsymbol{h}\boldsymbol{x} +\boldsymbol{n}, $$

(3)

where *h* is a multipath Rayleigh fading channel between any transmitter-receiver pair and *n* is additive white Gaussian noise. Due to the spatial decorrelation concept, the channel between different transmitter-receiver pairs is assumed to be different [31].