In this paper, we propose a neural networkbased hyperspectral target detection model, whose network structure is shown in Fig. 3. The overall network structure of SEHyp consists of a convolutional neural network with a firstorder target detection framework, which contains three major network modules: backbone module, neck module, and head module. The backbone module consists of six ResSEblock blocks to extract the feature information from the input data. The neck module consists of SPP blocks and a pyramidal convolutional structure, which uses different stages of feature information output from the feature extraction module as input, then performs fusion of different feature information to achieve detection of targets of different sizes. Finally, the head module is the output module of the network model, which is used to output the prediction value, including the coordinates of the target frame, the target category and the confidence level, which is used to determine whether there is a target in the target frame.
SEHyp backbone
As mentioned in the introduction part for the spectral characteristics of hyperspectral images, modifications are needed in the feature extraction module, so the attention residual module (ReSEblock) is proposed in this paper, and its network structure is shown in Fig. 4. The overall structure of the network consists of a residual network, and an attention mechanism is added to the final output link in order to be able to extract feature information between channels.
ResSEblock contains N Resblocks. N is the number multiplied by Resblock in Fig. 4. The number of channels is halved by the convolution of the input feature map, and then fused on both sides. This not only reduces parameters, but also reduces computation, while the number of channels remains the same. Finally, the output of the ResSEblock also goes through the SE block. It selectively emphasizes informative features through an attention mechanism. Nonlinear features between spectral bands are effectively extracted by selectively emphasizing informative features.
Resblock mainly consists of 1×1 and 3×3 convolution kernels. It uses 1×1 convolution kernels to compress the number of channels. This not only reduces the number of parameters of the model, but also reduces the number of computations of the model. The residual structure can effectively prevent gradient explosion and gradient disappearance as the number of network layers is added. It operates by jumping connections, adding the input feature map data to the output feature map data, and then transferring the result to the next layer. Finally, nonlinearity is introduced with the Mish activation function. The Mish formula is shown in formula (1)
$$Mish=x*tanh\left(ln\left(1 + {e}^{x}\right)\right)$$
(1)
The SE block (SequeezeandExcitation Block) was proposed by Jiehu et al [30]. It is an implementation of the attention mechanism which can improve the response of channel features. The SE module adaptively recalibrates the representation of feature channels. It learns to use global information to selectively enhance channel feature representations and suppress useless parts.
The structure of the SE module is shown in Fig. 5. The whole SE module can be divided into three steps. First, global average pooling is performed on U, outputs channel eigenvalues of 1 × 1 × C size. The data were then subjected to two 1 × 1 convolution operations. The first convolution compressed C channels into C/r channels and used the ReLU activation function to add nonlinearity to the data, where r is the compression ratio; the second convolution uses the sigmoid activation function to restore the channel to the C channel again. The obtained 1 × 1 × C weight data are multiplied with the input feature map of the corresponding channels as the next level input feature map. The mathematical formula is as follows:
$$Out=X*Sigmoid\left({F}_{2}\left(ReLU\left({F}_{1}\left({F}_{sq}\left(X\right),{W}_{1}\right)\right),{W}_{2}\right)\right)$$
(2)
W_{1} and W_{2} are parameters in the convolution operation, and F1(⋅, ⋅) and F2(⋅, ⋅) are convolution operations. The SE module selectively emphasizes informative features to enhance important channels and weakens nonimportant channels to improve the representation of features by learning and adaptively weighted features. Therefore, adding an attention mechanism to the feature extraction network allows the network to adaptively weight the feature information to highlight important features. The accuracy of the network can then be improved.
SEHyp neck
The neck of SEHyp model consists of SPP module and pyramid structured convolutional layers. The feature information is fused by different pooling operations and upsampling or downsampling and finally outputs the fused feature Si. The SSP module uses 1 × 1, 3 × 3, 5 × 5 and 13 × 13 pooling kernels for max pooling. The feature values get different perceptual field by pooling of different sizes. This allows different feature information to be obtained, therefore improving the detection capability of the network for small targets and the localization accuracy.
The role of the neck module of the SEHyp model is to fuse different feature information, enabling the network to improve the detection of targets of different sizes. The main work of this module is to fuse the feature information extracted from the backbone module. As shown in Fig. 6, the ith ReSEblock block in the backbone module is represented by μi(xiwi), where xi denotes the input data of the ith block and wi denotes the network parameters of the block. Use Q(awQ) for the neck module, where a is the input data of the neck module and wQ is its network parameters. The output feature values μi(x5w5), μi(x6w6) and μi(x7w7) of the backbone module are used as the input of the neck module parameter a, the different stages of the convolution layer obtain different feature information, the perceptual field of each pixel point is different, and the perceptual field obtained at different depth network structures will increase accordingly, so inputting feature information with different perceptual fields to improve the feature fusion efficiency can enhance the accuracy of different size target detection.
SEHyp head
The SEHyp head module is an output module of the model. For the model output module coupling problem, as shown in Fig. 7, the two branches are used to predict the target class and coordinates.
Object classification determination is a classification problem, while object location prediction is a regression problem. If both types of predictions use a convolution operation, the information is combined. This can make regression more difficult. Moreover, the spectral information unique to the hyperspectral map plays a crucial role in the detection, so two convolutional branches are used here for classification and coordinate prediction respectively. Two parallel branches are used to do the prediction of two tasks separately, so that different head modules can do their respective tasks and reduce the difficulty of prediction regression.
The head module is denoted by H(SiwHi), Si is input data of the head module, which is the feature information output by the neck module. The results Classi, Boxi and Predicioni output finally, where Classi is the probability of each category, Boxi is the coordinate of the center point of the target box with the box length and width values, Predicioni is the confidence level, which is used to determine whether the target exists in the box. The output of the head module, the target classification prediction output sizes are (52, 52, ClassNum × 3), (26, 26, ClassNum × 3) and (13, 13, ClassNum×3). Here, ClassNum is the number of models that can be classified, and 3 is the three prediction frames that the model predicts for each pixel. The target coordinates lose the predicted output sizes for (52, 52, 15), (26, 26, 15) and (13, 13, 15). 15 is calculated from the coordinate point and confidence from the three boxes, whereas the confidence level is used to determine if there is a target in the box.
Loss function
In order for the model to perform the inverse process, a loss function is also required to calculate the difference between the predicted and true values. The size of the final predicted output value is K × K × ((ClassNum + 5) × 3), where the side length of the grid is K. Therefore, the grid has a total of K × K grids. Each grid predicts three prediction frames. And each predicted frame requires the predicted class, the vertex coordinates of the frame, and the confidence. The loss function is shown in formula (3).
$$\begin{aligned} {\text{loss}}\left( {{\text{object}}} \right) & = \lambda_{{{\text{coord}}}} \mathop \sum \limits_{i = 0}^{K \times K} \mathop \sum \limits_{j = 0}^{M} I_{ij}^{{{\text{obj}}}} \left( {2{ }  { }w_{i} \times { }h_{i} { }} \right)\left[ {1  CIOU} \right] \\ & \quad  \mathop \sum \limits_{i = 0}^{K \times K} \mathop \sum \limits_{j = 0}^{M} I_{ij}^{{{\text{obj}}}} \left[ {\hat{C}_{i} \log \left( {C_{i} } \right) + \left( {1  C_{i} } \right){\text{log}}\left( {1  C_{i} } \right)} \right] \\ & \quad  \lambda_{{{\text{noobj}}}} \mathop \sum \limits_{i = 0}^{K \times K} \mathop \sum \limits_{j = 0}^{M} I_{ij}^{{{\text{obj}}}} \left[ {\hat{C}_{i} \log \left( {C_{i} } \right) + \left( {1  \hat{C}_{i} } \right){\text{log}}\left( {1  C_{i} } \right)} \right] \\ & \quad  \mathop \sum \limits_{i = 0}^{K \times K} \mathop \sum \limits_{j = 0}^{M} I_{ij}^{{{\text{obj}}}} \mathop \sum \limits_{{c \in {\text{classes}}}}^{ } \left[ {\hat{p}_{i} \left( {\text{c}} \right){\text{log}}\left( {p_{i} \left( {\text{c}} \right)} \right) + \left( {1  \hat{p}_{i} \left( c \right)} \right){\text{log}}\left( {1  p_{i} \left( c \right)} \right)} \right] \\ \end{aligned}$$
(3)
The loss function \({I}_{ij}^{\mathrm{obj}}\) is used to determine whether there is a target in the jth prediction box of the ith grid. When it is 1, there is a target. When it is 0, there is not. Therefore, when there is no target in the grid, only the fourth row of the formula is calculated that means only the confidence loss is calculated. The first two lines of the loss function are the loss function for the predicted frame. The CIOU algorithm [15, 31] was used. For the traditional IOU loss function [32], if the two boxes do not intersect, the distance between the two boxes cannot be reflected, which means the loss is 0. It does not accurately reflect the size of the overlap of the two boxes. The CIOU loss function solves the problem that the loss is 0 when the two frames do not overlap by calculating the Euclidean distance between the centroids of the two frames. It also increases the scale loss of the predicted frame. Thus, the regression accuracy is improved. This improves the accuracy of the regression. λ_{coord} is the weight coefficient, K is the side length of the grid, M is the number of predicted frames per grid, w and h are the width and height of the predicted frame, and x and y are the coordinates of the grid center point of the predicted frame. The formula for calculating confidence loss is in the third and fourth lines. Confidence loss is calculated using crossentropy. The loss value is still calculated when there are no objects in the grid. But its share in loss is controlled by λ_{noobj} weights. The last line of the equation is the loss function for the class. The crossentropy loss function is used. But the class loss is only calculated if there are targets in the grid.
In summary, the learning process of the hyperspectral target detection network model is as follows.

1.
Data processing is performed on the training data.

2.
Input the data into the network model.

3.
Perform data feature extraction by the backbone module.

4.
The feature values extracted from the previous module are fused with the features by the neck module.

5.
Input the fused features into the head module to make the final output data prediction.

6.
Calculate the loss function from the corresponding image labels and predicted values, and update the network parameters.

7.
Repeat steps (2)–(6) until the network converges or the training count is completed.