ABOS: an attention-based one-stage framework for person search

Person search is of great significance to public safety research, such as crime surveillance, video surveillance and security. Person search is a method of locating and identifying the queried person from a complete set of images. The main cause of false recall and missed detection in person search is the presence of person occlusion in the images. In order to improve the accuracy of person search when the person to be queried is occluded, this paper proposes an attention-based one-stage framework for person search (ABOS) using an anchor-free model as a baseline. The method uses the channel attention module to express different forms of occlusion and take full advantage of the spatial attention module to highlight the target region of the occluded pedestrians. These attention modules integrate deep and shallow features to guide the network to pay attention to the visible area of the occluded target and extract the semantic information of the pedestrians. Experimental results on CUHK-SYSU and PRW datasets show that the proposed person search method based on attention mechanism in this paper has better performance than existing methods, achieving 93.7%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} of mAP on CUHK-SYSU dataset and 46.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} of mAP on PRW dataset, respectively.

search methods can solve such problems well, and it has a vital role to play in the fields of crime retrieval, cross-camera personnel tracking, and personnel activity analysis.
Person search frameworks are mainly divided into three types: (1) two-step frameworks, which consider person re-identification and pedestrian detection as two independent tasks (as shown in Fig. 1a); (2) one-step two-stage frameworks, which utilize ROI Align of two-stage detectors to achieve end-to-end training for detection and reidentification (as shown in Fig. 1b); and (3) one-step one-stage frameworks, which focus on implementing the re-id task based on the anchor-free model. The current one-step one-stage framework achieves the best results in the person search domain, and its representative research results based on the anchor-free model are presented in the studies [7,8]. In this paper, the jointly optimized one-step one-stage framework is chosen to design a person search network model (as shown in Fig. 1c).
The first anchor-free-based person search model is proposed to solve the misalignment problem at three different levels (scale, region, and task) through a feature alignment aggregation module [7]. The study shows that the channels of convolutional features have different degrees of activation response for different parts of the pedestrian [9]. The form of target occlusion can be described by different feature channels, so that different occlusion problems can be handled effectively. An efficient attention module CBAM is proposed [10], which is composed by a channel attention module and a spatial attention module. Visualization analysis reveals that the network model embedded in the CBAM module will pay attention more accurately on the correct object to be classified during the inference process, achieving better detection results.
Inspired by the above results, an anchor-free model (ABOS) based on the spatial attention module and the channel attention module is proposed in this paper. The TOIM loss optimizer that combines the advantages of triple loss function and online instance matching (OIM) loss function is chosen. Our method emphasizes the importance of difficult samples and simplifies the batch construction process of the triple loss function, greatly speeding up the convergence rate and effectively improving the precision of person re-identification. In order to validate the accuracy of the ABOS model, experiments are performed on the complete image datasets CUHK-SYSU and PRW with identity information and bounding box, respectively. The main contributions made by this paper are as follows.
• An anchor-free model (ABOS) that fuses the spatial attention module and the channel attention module is used to express different forms of occlusion and highlight the target regions of occluded pedestrians. • By fusing the deep and shallow features of the attention modules, the network is guided to focus on the visible regions of the occluded targets. And the semantic information of pedestrians is extracted, thus improving the accuracy of person search when the occluded person is queried. • Helping the proposed model learn better features and selecting information in the scene adaptively by introducing attention mechanism. Extensive experimental results show that the method proposed in this paper is not only fast and efficient, but also solves the problem of low accuracy of the one-stage network compared with the existing methods. Achieved mean average accuracy (mAP) is 93.7% and 46.4% of the original evaluation protocols provided in CUHK-SYSU and PRW datasets, respectively.
The rest of this paper is presented below. The related work on person search, pedestrian detection, and person re-identification is presented in Sect.
2. An attention-based one-stage framework for person search (ABOS) method is discussed in detail in Sect. 3. Experimental results are analyzed in Sect. 4. Finally, the work of the full paper is summarized and the future planning is discussed in Sect. 5.

Related work
This paper summarizes the previous research experience in person search, pedestrian detection, and person re-identification. An attention-based person search method (ABOS) is constructed by using optimal technology.

Person search
The person search problem was first introduced by Xu et al. [11], and a sliding window strategy is proposed to combine person re-identification and pedestrian detection to model common and unique characteristics of pedestrians, but the efficiency is low.
Since the opening of the PRW and CUHK-SYSU datasets, the field of person search has attracted extensive attention and some methods have been proposed to improve its effectiveness. Zheng et al. [12] have discovered a cascade of fine-tuning strategies and confidence-weighted similarity metric to implement a two-step person search framework. Xiao et al. [1] proposed a combined OIM loss function and single neural network Faster R-CNN for joint training, which enhanced the feature recognition ability and realized a one-step two-stage end-to-end person search framework. Many improvement algorithms of this method were derived, such as multi-loss algorithm [13], which fuses pre-training and multi-loss; IEL algorithm [14], which refines the loss and improves the learned features for inaccurate pedestrian recognition robustness; and IAN algorithm [15], which uses a deeper network and adds a center loss (CL) function. Some researchers approach pedestrian search from a new perspective. Liu et al. [16] repositioned the person search task as a no-detection process and used convolutional long short memory networks to recursively correct the position of pedestrian frames and match to obtain accurate pedestrian locating frames. Chang et al. [17] proposed the relational context-aware agents (RCAA) algorithm that the person search framework was introduced to the deep reinforcement learning for the first time. A cross-level semantic alignment (CLSA) approach for multi-scale pedestrian matching was proposed by Lan et al. [18]. The first anchor-free model improving the efficiency and simplicity of the one-step model with optimal results was introduced by Yan et al. [7].
We improve its network structure and increase the precision of person search based on the anchor-free model in this paper.

Pedestrian detection
Early pedestrian detection methods are mainly based on linear classifiers, handcraft features, integral channel features (ICF) [19], aggregated channel features (ACF) [20], and deformable part model (DPM) [21], etc. With the rapid development of deep learning methods in the field of object detection, methods based on convolutional neural networks can extract more discriminative features. Ouyang et al. [22] proposed joint deep model to deal with occlusion problem by jointly learning the visibility and features of different body parts. Tian et al. [23] proposed StrongParts model to deal with the occlusion problem. Zhang et al. [24] applied the general-purpose object detection method Faster R-CNN to the task of the pedestrian detection, but its operation speed was not satisfactory. The methods in [25,26] achieved better results through various adjustments. Cai et al. [27] introduced a CompACT algorithm to learn CNN detector cascades.
The previously introduced methods are mainly two-stage detectors. In terms of onestage detectors, a RetinaNet model was proposed by Lin et al. [28] with focal loss to address the class imbalance problem. The fused DNN detector proposed by Du et al. [29] used SSD detectors as pedestrian region candidate networks. Parallel multiple networks are used to optimize the candidate regions, and the semantic segmentation information is also incorporated into the detection process, effectively solving the small-scale and occlusion problems. Yan et al. [7] developed a one-stage-based anchor-free detector, achieving the best results.
Furthermore, the attention mechanism can also be introduced to solve the occlusion problem. Zhang et al. [30] proposed to express different forms of occlusion using the inter-channel attention mechanism, each of which can be expressed as a weighted combination between different channels.
In this paper, we adopt an anchor-free detector that combines with an attention mechanism to enhance feature extraction, to train the network to focus on the iconic part of the pedestrian, and to suppress the accuracy degradation problem caused by occlusion.

Person re-identification
Person re-identification aims to query pedestrians from a set of pedestrian candidates. Early person re-identification methods are mainly based on metric learning and artificially designed features. Currently, CNN-based re-identification methods fall into two main categories: classification models and siamese models. These two types of models are often trained as feature extractors with siamese loss [31,32], triplet loss [33], and cross-entropy loss [34,35]. Cheng et al. [36] trained CNN models for maximizing the feature distances between different pedestrians and minimizing the feature distance between the same pedestrians by using triplet samples. Xiao et al. [37] classify identities to learn features while using a dual or triple loss function. The classification model proposed by Zheng et al. [12] achieves higher accuracy than the siamese model. Han et al. [38] proposed a novel proxy triplet loss, which solves the problem of inability to construct a standard triplet loss function in person search.
In recent years, attention [39,40] mechanisms have been employed to learn better pedestrian recognition features. The PDC model [41] enhances pedestrian features with pose normalized images and re-weighting using channel attention. The HydraPlus-Net model [42] aggregates multiple feature layers within a spatial attention region. Xu et al. [31] solved the problems of person re-identification and pose estimation in a joint framework, where pose estimation results generate spatial attention maps and visibility scores.
As for the problem that re-identification is prone to regional dislocation in the anchorfree framework, this paper introduces both the channel attention mechanism and the spatial attention mechanism into the anchor-free framework and follows the "re-id first" principle to generate more discriminative and robust feature embedding and thus makes person search much more accurate.

Methods
This paper designs an attention-based one-stage framework (ABOS) for person search, which integrates a new anchor-free pedestrian detector and TOIM loss re-identification technology. The steps are as follows: (1) The anchor-free pedestrian detector in [43] is improved to integrate the attention module, which can fully guarantee the performance of the detection task; (2) TOIM loss [7] is integrated with the improved anchor-free pedestrian detector in this paper and follows the "re-id first" principle to effectively solve the problem of re-identification task in person search without generating additional re-id features; and (3) training and testing are performed on the CUHK-SYSU and PRW datasets, and the experimental results demonstrate the feasibility and advancement of the ABOS model.

Model structure
The ABOS framework is improved and designed based on the FCOS network [43]. As shown in Fig. 2, for the input image I ∈ R 3×H ×W , we extract a set of features {C 1 , C 2 , C 3 , C 4 } from ResNet-50. According to [44], the low-level features in the shallow convolutional layer retain the spatial information used to construct the edges of The first layer of the feature pyramid aggregates features of the multi-level feature maps in the backbone. For re-id, we only need to learn from the maximum output feature map P 1 .

Attention module
In this paper, channel attention module and spatial attention module are used to enhance the more discriminative feature representation, respectively. For example, embed the feature layer F 1 {C 1 , C 2 } fused by upsampling and concatenation operations into a spatial attention module, while F 3 {C 3 , C 4 } is embedded into the channel attention module.

Channel attention module (CA)
The channel attention module is mainly concerned with "what" is meaningful in a given input image F, and each channel of the feature map is treated as a feature detector, generating a descriptive vector V ∈ R C×1×1 that expresses the importance of the channel. As shown in Fig. 3, for the input feature map F 3 , both average pooling and max pooling are used to aggregate the global information of each channel feature, which generates the average pooling feature description vector V avg ∈ R C×1×1 and the maximum pooling feature description vector V max ∈ R C×1×1 , respectively. Then, the two description vectors are embedded in a shared network consisting of a hidden layer and a multilayer perceptron (MLP) to generate the channel attention vector V c . The calculation formula is as follows: where δ denotes the ReLu activation function, σ denotes the sigmoid function, and W 0 ∈ R C/r×C and W 1 ∈ R C×C/r are the weights of the two fully connected layers of the MLP, where r denotes the dimensionality reduction ratio. F chn is obtained by weighting F 3 . The calculation formula is as follows: where ⊗ denotes the channel-by-channel multiplication.

Spatial attention module (SA)
The spatial attention module is mainly concerned with "where" of the information part, and spatial attention maps are generated from the spatial relationships between features. The spatial attention map is used to reactivate the input features so that the model focuses on obscuring the target pedestrians and suppressing the interference of the background in this paper. As shown in Fig. 4, for the input feature map F 1 , first concatenate average pooling and maximum pooling to generate spatial description feature map F avg ∈ R 1×H ×W and F max ∈ R 1×H ×W , respectively. And then generate the spatial attention map M s ∈ R 1×H ×W by using a 3×3 convolutional layer. The calculation formula is as follows: where f 3×3 denotes the 3×3 convolutional layer and σ denotes the sigmoid function. The final feature map F sp is obtained by reactivating the input feature map F 1 , which is calculated as follows: where ⊙ denotes the element-by-element multiplication of the feature map.

Optimizer
In the study of person search, due to the regression inaccuracy and missed detection caused by the occlusion problem, it is necessary to narrow the feature differences of (1) the same pedestrian instance and expand the feature differences of different pedestrian instances. Since the person search dataset has large amount of training IDs with limited number of samples per ID, the triplet-aided online instance matching (TOIM) loss function is used in re-id, which combines the OIM loss function and the triplet loss function. A lookup table (LUT) V ∈ R D×L is defined in OIM to store all features of tagged identities, where L denotes the table size and D denotes the feature size. Also, define a circular queue U ∈ R D×Q to store the unlabeled identity features, where Q denotes the size of the queue. Given a label i and input feature x, the probability that x is considered as the i-th feature with labeled identity according to the above two data structures is:

Fig. 4 Spatial attention module
where v T i x denotes the cosine similarity between x and the i-th labeled identity, u T k x denotes the cosine similarity between x and the k-th unlabeled identity, and τ denotes the hyperparameter that controls the softness of the probability distribution. The goal of OIM is to maximize the expected log-likelihood function of feature x as: where t denotes the label of the labeled identity, and its gradient with respect to x can be derived as: The set of candidate features with identity labels n and m is defined in triplet X n = x n,1 , . . . , x n,S , v n and X m = x m,1 , . . . , x m,S , v m , where S denotes the number of features sampled for a pedestrian, v i denotes the i-th feature in the LUT and x i,j denotes the j-th feature of the i-th pedestrian. The triplet loss function is calculated as follows: where M denotes the boundary distance between negative and positive samples, D neg denotes the distance between negative sample pairs, and D pos denotes the distance between positive sample pairs. In summary, the TOIM loss function is calculated as follows:

Experiments and results analysis
This section describes the details of the experiment and the experimental environment.

Datasets
CUHK-SYSU [1] is a scene-diverse and large-scale person search dataset, containing 18184 images, 8432 identities, and 96143 bounding boxes with annotations with (5)  annotations (only 96131 after deduplication). The dataset is mainly derived from street photography and movies. The dataset is divided into training set and test set, the training set includes 11206 images, 5532 identities, and 55272 bounding boxes with annotations (only 55260 after deduplication), and a test set contains 6978 images, 2900 identities, and 40871 bounding boxes with annotations. A set of protocols ranging in size from 50 to 4000 are defined in the test set. PRW [12] is a large-scale person search dataset, containing 11816 images, 932 identities, and 34304 bounding boxes with annotations. The dataset is acquired by six different cameras on a university campus. The dataset is divided into test set and training set, a test set contains 6112 images and 450 identities, and the training set contains 5704 images and 482 identities.

Evaluation metrics
This paper uses the mean average precision (mAP) and top-1 accuracy of the original evaluation protocol provided by the dataset to evaluate the people search performance of the ABOS model. Average precision (AP) summarizes the precision-recall curve for a category, which is the weighted average of the precision obtained at each confidence threshold, and the increase in recall is used as a weight compared with the previous confidence threshold, calculated as follows: where R n and P n represent the recall and precision of the n-th confidence threshold, respectively. mAP represents the average AP over all categories, and top-1 represents the accuracy of the category with the first predicted probability matching the actual result.

Experimental details
In this paper, PyTorch and MMDetection tools are used to implement an attention-based one-stage framework for person search (ABOS). The backbone network uses ResNet-50 pre-trained on ImageNet, taking 4 layers {C 1 , C 2 , C 3 , C 4 } from ResNet-50. The attention network consists of {C 1 , C 2 } + SA, {C 2 , C 3 }, and {C 3 , C 4 } + CA, and detailed information is shown in Table 1. The learning rate on NVIDIA Corporation GP104GL 8GB is set to 0.0001. τ in the OIM loss function is set to 0.1. Stochastic gradient descent (SGD) is used to optimize the batch size which is 1 of the network, and the weight decay is 0.0001. The training set includes the ground-truth bounding box and the detection bounding box of the training image. During the training process, the long edges of the images are randomly adjusted from 667 to 2000, and the images in the test set images are adjusted to 1500×900 during the testing process.

Comparison with other attention models
The effectiveness of fused attention networks in person search applications in recent years is compared in Table 2. IDE+ATT-part [9] proposed a self-matching speed learning method based on attention guided to balance different occlusion levels. DHFF [45] treats the shallow network as an attention network, which achieved mAP 90.2% and top-1 91.7% on the CUHK-SYSU dataset, and mAP 41.1% and top-1 70.1% on the PRW dataset. QEEPS [4] proposed an improved QSSE-Net based on Squeeze-and-Excitation attention network, which achieved mAP 88.9% and top-1 89.1% on the CUHK-SYSU dataset, and mAP 37.1% and top-1 76.7% on the PRW dataset. Obviously, ABOS model fused with attention networks achieves the best effect.

Comparison with prior art
The anchor-free model is used as a baseline for comparison experiments on the CUHK-SYSU and PRW datasets to verify the effectiveness of the attention network designed in the paper. The workstation graphics card configuration is NVIDIA Corporation GP104GL 8GB. Due to the limited hardware conditions in the laboratory, the parameter configuration is reduced to reproduce the project [7], reducing the learning rate and weight by 10 times, and reducing the batch size to 1.
Comparison of results on the CUHK-SYSU dataset: The evaluation metrics mAP and top-1 obtained by reducing the parameter configuration to reproduce the project [7] are 92.5% and 93.0% . Respectively, on this basis, the proposed network model is implemented. As shown in Table 3, it is clear that the ABOS model is improved on the CUHK-SYSU dataset, where the mAP is 93.7% and top-1 is 94.3% . Comparing with the first anchor-free person search model AlignPS [7], our proposed model has a relative improvement of 1.2% on mAP and 1.3% on top-1.
Comparison of results on the PRW dataset: The evaluation metrics mAP and top-1 obtained by reducing the parameter configuration to reproduce the project [7] are 45.4% and 83.3% , respectively. As shown in Table 3, the proposed model improves the mAP metric and top-1 metric for person search on PRW dataset, where mAP is 46.4% and  Fig. 5. Person search is more challenging as the gallery size increases, and the person search has the best performance when the gallery size is equal to 100. It can be concluded that our method is superior most person search methods compared to state-of-the-art person search techniques today. Therefore, the proposed model can be used in practical applications.

Results and discussion
To address the pedestrian occlusion problem commonly faced in person search, this paper proposes an attention-based one-stage framework for person search (ABOS). Introducing spatial attention mechanism and channel attention mechanism into the multi-scale feature fusion, the channel attention module establishes the correlation between channels in the deep feature map, while the spatial attention module extracts the spatial information of the  shallow feature map to highlight the obscured pedestrian targets. The TOIM loss is chosen in the re-identification module to further enhance the learning ability of the network for pedestrian similarity and the distinguishability of pedestrians. In this paper, the two tasks of person re-identification and pedestrian detection are fused in a network model for joint modeling optimization, thus improving the detection accuracy and recognition rate when the person to be queried is obscured. To verify the effectiveness of the proposed method, training and testing are carried out on CUHK-SYSU and PRW datasets. Numerous experiments have shown that the ABOS model outperforms most existing person search models and achieves high accuracy. Due to the complexity of surveillance scenarios, person search is still a long way from practical applications. In the future, we will be working on the following four areas.
• Due to the limited hardware conditions in the laboratory, the proposed model fails to achieve the best results. We will apply a graphics card with a higher configuration and find the optimal parameters through continuous experiments to achieve the best results of the model in the future. • Although the person search model proposed in this paper achieves good results for person search in general scenarios, misjudgment still occurs in some special scenarios. For example, the search results do not match the target pedestrians in rainy or foggy days with low visibility. In the future, we will further investigate the person search methods in different environments [49]. • Our proposed method is only for images. How to search for target pedestrians in a video requires further refinement of the model and testing in the future. • The person search methods nowadays generally have not been widely used in the field of security, etc. and we still need to carry out a lot of theoretical research and equipment debugging [50][51][52].
Abbreviation ABOS An attention-based one-stage framework for person search