Research on image recognition of intangible cultural heritage based on CNN and wireless network

The image of national costumes is the main way of presenting the digitalization of intangible cultural heritage and provides important resources for educational informatization. How to use modern information technology to efficiently retrieve images of national costumes has become a hot research topic. Due to the diverse styles and colorful colors of ethnic costumes, it is difficult to accurately describe and extract visual features. In view of the above problems, this paper proposes an image recognition model of intangible cultural heritage based on CNN and wireless network. First of all, the clothing images of ethnic settlements and ethnic museums are acquired through wireless network transmission, so as to construct an image material library of intangible cultural heritage. Secondly, the CNN algorithm is used to train and optimize the national costume image samples, extract the high-level semantic features of the national costume image and finally realize the efficient retrieval of the national costume image educational resources. This model lays a good foundation for the informatization of national costumes and contributes to the inheritance and protection of intangible cultural heritage.

How to inherit and protect the traditional culture of national costumes has become a hot spot in contemporary social research.
At present, many scholars have conducted a lot of research on the inheritance and protection of ethnic minority costume traditional culture [4]. Song Peng et al. discussed the digital technology in ethnic minorities from the aspects of digital design, digital production, digital promotion, establishment of national clothing material database, etc., application of clothing modernization [5]. Qi Yali took Zhuang nationality clothing as the construction object. Through the establishment of a special database, it has provided useful experience and demonstrations for rescue and protection, in-depth development and utilization of the traditional costume culture of the Chinese nation, and the construction of a digital project of national costume culture [6]. However, most of these studies are still conducted in the traditional way, and computer vision technology is rarely used in inheriting and protecting the national costume culture [7,8]. On the one hand, this is because most scholars study the inheritance and protection of national costume culture from the perspective of social science and lack the application of information [9]. On the other hand, ethnic minorities have their own unique costume culture, and the workload of informatization based on the national costume image database is very large [10].
Therefore, in the construction of standards, a small number of standardized ethnic groups can automatically carry out digital research on the extraction and analysis of ethnic clothing elements, which has important research significance and application value. Therefore, this paper proposes a research model of intangible cultural heritage image recognition based on wireless sensors and CNN. Combined with the wireless sensor hardware design and CNN and other data processing optimization technologies, the national clothing recognition technology program was studied.

Wireless image transmission based on intangible cultural heritage
The wireless sensor network (WSN) is composed of a large number of low-cost, small and low-power miniature sensor nodes. Due to its flexible deployment, strong scalability, and good economy, it has broad application prospects in the fields of environmental monitoring, medical care, intelligent transportation and environmental monitoring [11]. In the process of image recognition research on intangible cultural heritage, wireless sensor networks are used to transmit and obtain clothing images of ethnic residential areas and ethnic museums to build an intangible cultural heritage image resource library.

Overall design
The sensor nodes deployed in the wireless sensor network have the characteristics of small size, low power consumption, low cost and strong functions. The typical architecture of wireless sensor network is shown in Fig. 1. It can be seen from the figure that the sensor nodes are connected to the network system through short-range wireless communication and self-organization, adaptation, self-recovery and multihop. They are usually deployed in some specific target areas, through self-organized multihop wireless networks, through other nodes, through hops to receiver nodes and hops to transmit data hops to receiver nodes [12]. The power supply provides the necessary energy to ensure the energy required for the daily work of the sensor node. The program or data of the sensor node are stored in the memory part. Sensing components are used to sense and obtain attribute information of the physical world [13]. The typical architecture of wireless sensor network is shown in Fig. 1.
With the help of WSNs, human beings can achieve a more convenient understanding of the objective world and have the ability to recognize and familiarize themselves with the objective world. The low cost of WSNs network nodes, self-provisioned power supply and no need for the support of communication infrastructure, so in some complex target environment area can also be self-organized into a network, the target area to form an effective monitoring.

Collaborative communication in wireless sensor networks
Due to the multipath effect in wireless communication systems, the signal changes during transmission. Using MIMO technology, multiple antennas are installed at the sending end and the receiving end to transmit and receive data. Because each transmitting antenna sends signals simultaneously and occupies the same frequency spectrum [14]. MIMO systems use multiantenna technology to double the channel capacity and spectrum utilization of the system and convert the original multipath effect into a favorable factor for wireless communication. Multiple antennas in a MIMO system correspond to different spatial positions. The signal coding can be extended to a two-dimensional space-time setting, and the transmission and reception signals are optimized through the space-time size [15]. In this way, higher transmission can be obtained. The wireless sensor network cooperative communication MIMO technology is shown in Fig. 2.
The MIMO system model is shown in Fig. 2. Assuming that the MIMO system has M transmit antennas and N receive antennas, the sender uses space-time coding (STBC) to map the transmission signal to M different information symbols. Then the signal is transmitted through the M transmitting antenna at the same frequency. The receiving end uses the N antenna to simultaneously receive parallel information  symbols from different paths at the sending end and then processes the received information symbols to restore the original transmission signal [16].
The signal vector received by the receiving end of the MIMO system can be expressed as: Here, H is the N × M dimension channel matrix and x represents the transmission signal. n is the channel noise. The channels used here are assumed to be independent fade-in and fade-out channels: Here, (M, N) is the smaller value of M and N, B represents the signal bandwidth and represents the average signal-to-noise ratio at the receiving end. It can be seen from the above formula that when the power and bandwidth are fixed, the channel capacity C of the MIMO system is not only related to the signal bandwidth and the received noise ratio, but also related to the number of antennas at the receiving end and the transmitting end. By increasing the number of transmitting antennas at the transmitting end and the number of receiving antennas at the receiving end, MIMO channel characteristics can be fully utilized.

CNN-based image recognition of intangible cultural heritage
The construction of the digital ethnic clothing library protects and spreads the importance of the inheritance of ethnic clothing culture, but the construction of the ethnic clothing library is still in the stage of visual and frame design. Although some research has achieved database design and construction, it is mainly aimed at the limited elements and patterns of ethnic clothing or generated for a single ethnic clothing database [17]. Therefore, on the basis of establishing a standard minority clothing image library, it is of great research significance and application value to automatically carry out digital research on the extraction and analysis of ethnic clothing elements.

Image feature extraction
Image feature extraction is an important content in computer vision. The purpose is to express the image content in order to identify the image. The process of image feature extraction is to extract the main pixels that make up the image through a computer and then mathematically analyze these coherent pixels to determine their feature attribution. Color characteristics are the most basic visual characteristics of human understanding of the world [18].

Color characteristics
Compared with texture features and shape features, color features have better stability. Because the extraction and calculation methods of color features are relatively simple and can well express the visual information of the image, they are widely used in contentbased image retrieval. The purpose of extracting color features from digital images is to use computers to process color images. Quantitative methods must be used to describe color characteristics [19]. The choice of color space and the quantification of color information are the key determinants of the accuracy of image description in color feature extraction methods. The schematic diagram of the HSV color space is shown in Fig. 3.
HSV color space is a color space model based on visual perception designed by Smith Alvy Ray to improve the color extraction interface of computer graphics software. From the perspective of psychology and vision, HSV color space uses the three elements of human eye color perception to refer to the color of light. It is mainly related to the main wavelength of light in the mixed spectrum such as red, orange, red, green, cyan, blue, purple, and other colors represent different hues [20].

Space color characteristics
All the above-mentioned global color features have obvious defects, that is, the loss of the spatial position information of the color in the color image. Therefore, when the above-mentioned color feature description method is used for content-based image retrieval, it will bring a great error to the retrieval result. The image area division method is shown in Fig. 4.
Local color features are another method of describing color features considering the color space structure of the image [21,22]. This method divides the image into

Convolutional neural network algorithm
Deep learning is a data representation learning method based on artificial neural networks (ANN), which is also part of machine learning. An artificial neural network is a computing network that simulates a biological brain neural network. Its basic structure is an artificial neuron. Each neuron performs weighted summation on multiple inputs and then uses some nonlinear functions to obtain the output and pass it to other neurons [23]. Until the 1990s, LeCun proposed convolutional neural network (CNN) and successfully applied it to the recognition of handwritten digits, drawing a strong stroke in the history of deep learning. Compared with traditional neural networks, convolutional neural networks use a more reasonable connection between different layers. Convolutional neural networks are formed by stacking multiple different network layers. These layers generate corresponding output from input real numbers through some basic mathematical operations and separable functions [24].

Convolutional layer
The convolutional layer is the core module in CNN and contains a series of convolution kernels. These convolution kernels can also be called filtering wave former, and each convolution kernel has a relatively small receptive field on the input image. During the convolution operation, the convolution kernel moves on the input image at a certain step size. After each movement, the dot value operation is performed on the weight value in the convolution kernel and the pixel value of the image area corresponds to the receptive field. Then, the results of the dot product are summed to obtain the value of the corresponding position on the output feature map [25]. Generally, after performing a convolution operation on the input image, the size of the output feature map is smaller than that of the input image.
The convolution operation is expressed as follows: Here, x l j represents the jth feature map of layer l and x l+1 j represents the jth feature map of layer l + 1.

Transpose the convolution layer
Sometimes, in order to visualize the characteristics of the network or want to output a larger image through the network, it is usually necessary to use a transposed convolution layer. Unlike general image upsampling operations, transposed convolution does not use a specific interpolation method; as a specific layer of a convolutional neural network, it also has learnable parameters [26]. The schematic diagram of convolution operation and transposed convolution operation is shown in Fig. 5.
Taking the convolution shown in Fig. 5 as an example, a convolution kernel with a size of 3 × 3 and a step size of 1 is used to convolve a matrix of size 4 × 4, and the resulting output is: Here, C is the coefficient matrix representation of the convolution kernel and the elements of the ith row and jth column of the convolution kernel are w i,j .
X 16 * 1 is to stretch the input matrix from top to bottom and from left to right into a vector of length 16. Y 4 * 1 is the output vector, and the final output can be obtained by transforming Y 4 * 1 into a 2 × 2 matrix [27].
The transposed convolution of Y 4 * 1 yields:

Fig. 5 Schematic diagram of convolution operation and transposed convolution operation
The X 16 * 1 matrix transformed into 4 × 4 is the output of the final transposed convolution.

Pooling layer
The pooling layer is a nonlinear downsampling of the input feature map. Its purpose is to reduce the dimension of elements, thereby indirectly reducing the number of parameters that the network needs to learn in the future and reducing the risk of overfitting the network [28]. Like the convolutional layer, the pool layer also has a core concept. After the given kernel size and moving step size, different acceptance fields can be selected on the input feature map. A nonlinear function is then used to reduce the sampling of pixels in the accepted field area. The more common pool operations are the maximum pool and the average pool.
The operation of pooling is shown below: Here, x l j represents the ith feature map of the lth layer. x l+1 j represents the ith feature map of the l + 1th layer.

Fully connected layer
After a series of convolution and pooling operations, it is usually necessary to use fully connected layers to integrate previous feature maps to further obtain high-level features of the image [29,30]. There is a connection relationship between each feature value of the fully connected layer and all output feature values of the previous layer, which is the same as the traditional artificial neural network, so the fully connected layer is usually attractive [31,32].
The operation of the fully connected layer is as follows [33,34]: Here, x is the input, y is the output and w is the weight matrix of the fully connected layer.

Loss layer
The loss layer is used during the training phase of the network [35]. It calculates the penalty value of the network based on the difference between the predicted output of the network and the actual result. Subsequent network parameter adjustments are based on this penalty value [36,37]. The training purpose of the convolutional neural network is to find a set of the most suitable network parameter combination through a certain optimization strategy, so that the overall loss of the network reaches a minimum. Different tasks use different loss functions in the loss layer [38,39]. L2 loss is suitable for regression tasks, such as the regression of surface coordinates in pictures. Taking L2 loss as an example, the formula is as follows [40]: Here, y is the predicted value of the network. y is the real mark.

Image selection of national costumes
Guizhou Province in China is a multiethnic province. There are a total of 17 ethnic groups in the world, including the Miao, Buyi, Dong, Tujia, Yi, Huolao, Shui, Hui, Bai and Yao ethnic groups. According to data from the fifth census, the Miao nationality is the most populous ethnic minority in Guizhou Province. The Miao costumes are rich in resources and are the most gorgeous costumes among all ethnic minority costumes. They are often regarded as "walking history books" worn on the body and have a strong national representative. Therefore, this article selects Miao costumes as the research object, as does the research on other national costumes.
The patterns of Miao costumes can be roughly divided into three categories: geometric patterns, animal patterns and plant patterns. Geometric patterns mainly include cross patterns, tic-tac-toe patterns, diamond patterns, quadrilateral patterns and geometric objects such as sun patterns and moon patterns. These geometric patterns are passed down from generation to generation and contain the beautiful value of the Miao people's pursuit and longing for a peaceful and happy life. The most common plant pattern is the peony pattern, which expresses the praise of the Miao people to nature. Animal patterns are the most used patterns in Miao costumes, mainly including animal shapes such as Miao dragon, butterfly, bird and fish.

Comparison of intangible cultural heritage image recognition
In this experiment, 1000 images containing Miao ethnic costumes were selected for retrieval test. In each test image retrieved using the algorithm of this article, the average retrieval accuracy of 1000 test images is finally calculated to obtain the final average retrieval accuracy.
In order to verify the degree of influence of the color histogram feature and the edge direction feature on the retrieval performance, we used two features for image recognition test. The comparison of the recognition effect of intangible cultural heritage images  Fig. 7. We can see from Figs. 6 and 7 that the retrieval accuracy of most algorithms first rises and then decreases, which means that the effective fusion of multiple feature extraction algorithms does improve the retrieval performance.
From the experimental results, compared with several other algorithms, the performance of the convolutional neural network algorithm is very good, with an accuracy rate of 94.8%. The support vector machine, particle swarm optimization and BP neural network algorithms do not perform well enough. Among them, the accuracy of BP neural network is the lowest, only 68%. The main reason is that the BP neural network algorithm uses only low-order moments. The experiments in this paper have used the most third-order moments, so it is difficult to describe the details. Therefore, the algorithm is suitable for simpler images with better shape description, and the low accuracy rate is also the most prominent shortcoming of the algorithm. The accuracy of the support vector machine algorithm is 72%, which is also not ideal, mainly because the colors of some national costume images are relatively similar. The accuracy of the particle swarm algorithm is 79%, which is slightly higher than the other two low-level features, but it still performs poorly. This is due to the small difference in the texture density of ethnic costumes, which leads the texture-based algorithm to have low accuracy. Therefore, the experimental results prove that the retrieval accuracy of the convolutional neural network is significantly better than the underlying visual features and the feature extraction method based on support vector machine, particle swarm and BP neural network, which can accurately extract the features of national costumes.

Conclusion
At present, the use of computerized digital methods to protect and inherit intangible cultural heritages such as national costume culture has been a research hot spot. Therefore, a fast and high-accuracy retrieval method is proposed to retrieve the national costume image education resources and then provide a way for the digital protection of national culture, which is very meaningful for the inheritance of national culture. This paper studies and analyzes the image retrieval of national costume based on convolutional neural network. Clothing images of ethnic settlements and ethnic museums are obtained through wireless network transmission, thereby constructing an intangible cultural heritage image resource library. The CNN algorithm is used to train and optimize the national costume image samples, extract the high-level semantic features of the national costume image and finally realize the efficient retrieval of the national costume image educational resources. Although this paper has made some research results on the image recognition of intangible cultural heritage such as national costumes, there are still many problems that need to be improved. In the future, we will further carry out research in this area with a view to contribute to the inheritance and protection of intangible cultural heritage.