Spatial-spectral hyperspectral image classification based on information measurement and CNN

In order to construct virtual land environment for virtual test, we propose a construction method of virtual land environment using multi-satellite remote sensing data, the key step of which is accurate recognition of ground object. In this paper, a method of ground object recognition based on hyperspectral image (HSI) was proposed, i.e., a HSI classification method based on information measure and convolutional neural networks (CNN) combined with spatial-spectral information. Firstly, the most important three spectra of the hyperspectral image was selected based on information measure. Specifically, the entropy and color-matching functions were applied to determine the candidate spectra sets from all the spectra of the hyperspectral image. Then three spectra with the largest amount of information were selected through the minimum mutual information. Through the above two steps, the dimensionality reduction for hyperspectral images was effectively achieved. Based on the three selected spectra, the CNN network input combined with the spatial-spectral information was designed. Two input strategies were designed: (1) The patch surrounding the pixel to be classified was directly intercepted from the grayscale images of the three selected spectra. (2) In order to highlight the effect of the spectrum of the pixel to be classified, all the spectral components of this pixel were superimposed on the patch obtained by the previous strategy. As a result, a new patch with more prominent spectral components of the pixel to be classified was obtained. Using the two public hyperspectral datasets, Salinas and Pavia Center, the experiments of on both parameter selection and classification performance were performed to verify that the proposed methods had better classification performance.

(2020) 2020:59 Page 2 of 16 feature extraction and band selection. The feature extraction method mainly includes principal component analysis (PCA) [9], linear discriminant analysis (LDA) [10], multidimensional scaling [11], etc. The band selection method mainly constitutes of the examining of correlation [12], the calculation of mutual information [13], etc. In recent years, information-based band selection has become a very popular research topic, which usually uses Shannon entropy or its changes, such as mutual information, as the basis for the measurement of image information. Mart?nez-Us? et al. proposed a clustering method based on mutual information for the automatic band selection in multispectral images [14]. Wang et al. proposed a supervised classification method based on spatial entropy for the band selection [15]. Moan et al. proposed a new spectral image visualization method to achieve the band selection through the third-order information measurement [16]. Salem et al. proposed a band selection method based on the hierarchical clustering of the spectral layers. A c-means clustering algorithm that combined spatial spectrum information was proposed [17]. A unified framework was proposed to evaluate the four methods from the five perspectives, including feature entropy and classification accuracy [18]. Hossain et al. proposed a dimensionality reduction method (PCA-nMI) that combined PCA with normalized mutual information (nMI) with two constraints [19]. Therefore, the band selection method based on the information measurement can effectively reduce the dimensionality of the hyperspectral image, maximize the selected band information, minimize the redundancy, and achieve the excellent classification performance for the spectral images under the reduced computational complexity. The same ground objects may have different spectra and the different ground objects may have similar spectral features, thus the classification only by the spectral information may cause errors sometimes [20,21]. In order to address this problem, the hyperspectral image classification method with spatial-spectral information has appeared in recent years. Using a stack auto-encoder (SAE) to combine the spatial and spectral features, Chen et al. proposed a spatial principal component information classification method [22]. Slavkovikj et al. proposed a CNN framework for hyperspectral image classification, in which spectral features were extracted from a small neighborhood [23]. Makantasis et al. proposed an R-PCA CNN classification method [24], in which the PCA was first used to extract spatial information, and then CNN was used to encode spectral and spatial information. Makantasis et al. used CNN to encode the spectral and spatial information of hyperspectral data. In addition, the multilayer perceptrons were used for the classification of the hyperspectral image [24]. A regularized feature extraction model based on CNN was introduced to extract the effective spatial-spectral features for classification [25]. Ghamisi et al. proposed a SICNN model. In the model, CNN was combined with fractional-order Darwin particle swarm optimization (FODPSO) to select the band, which had the largest amount of spatial information and fit the input of CNN model [26]. Mei et al. proposed the SS-CNN classification method and discussed a CNN framework for merging spatial and spectral features [27]. Zhao et al. proposed a classification framework based on the spatialspectral feature (SSFC). In the frame, the dimensionality reduction and depth learning algorithms were used to extract spectral and spatial features, respectively. In addition, CNN was used to automatically find deep spatial correlation features. The joint features were extracted together by stacking spectral and spatial features, and finally the hyperspectral image was classified based on the trained multi-feature classifier [28]. Therefore, the classification method based on the spatial-spectral information for the hyperspectral image can fully extract the joint spatial-spectral features and achieved higher classification accuracy than the classification method relying on the extraction of only the spectral information or only the spatial information.
As an advanced machine learning technology, deep learning uses deep neural networks to learn hierarchical features of the raw input data from low to advanced level. Deep learning technology has been widely used in image classification [29], agronomy [30], mobile positioning [31], and hyperspectral image classification [32]. Models based on the convolutional neural networks (CNN) could detect local features of the input hyperspectral data, achieving highly accurate and stable classification results [20]. CNN has proven to be superior to SVM in classifying the spectral features extracted from hyperspectral images [33]. Since the feature reliability of each pixel determines the accuracy of classification, it is important to design a feature extraction algorithm dedicated to hyperspectral image classification. Ma et al. proposed a context deep learning algorithm for feature learning, which can better characterize information than the extraction algorithms with predefined features [34]. Pan et al. proposed a novel simplified deep learning model based on regularized greedy forest (RGF) and vertex component analysis network (R-VCANet), which achieved higher accuracy when the number of training samples was insufficient [35]. Mou et al. proposed a new recurrent neural network (RNN) model that can effectively analyze hyperspectral pixels and use them as sequence data to derive information categories through network derivation [36]. A semi-supervised classification method based on multidecision marking and deep feature learning was proposed to achieve classification tasks using as much information as possible [37]. Zhang  representations to obtain valuable features and improve classification accuracy [38]. Niu et al. proposed a learning to rank model using Siamese convolutional neural networks for 2D and 3D image quality assessment [39]. Therefore, in hyperspectral image classification, the application of deep learning has become more and more extensive and achieved remarkable results. Especially, CNN has been proven to have great advantages in extracting image feature information and realizing image high accuracy classification.
In this paper, a HSI classification method based on information measure and CNN was proposed. Firstly, the uncorrelated spectra were excluded through the calculation of the entropy information, and the preliminary selection of the candidate spectra was completed by combining the color-matching functions. Then, the redundant spectra were identified with the mutual information. In addition, the spectra containing the most useful information were further selected by calculating the minimum normalized mutual information. Moreover, using the selected spectra based on the information measure, the pseudo-color images were generated to achieve the texture information of the ground. Furthermore, a CNN classification method based on the information measure and the enhancement of spectral information was proposed. The integrated spatial-spectral information was put into CNN to achieve high classification accuracy of hyperspectral image.

Related work
Virtual test technology is one of the research fields of our team. Virtual test relies on virtual environment, and synthetic natural environment is an important part of virtual environment, which includes atmosphere, land, ocean, and space environment [40]. The virtual land environment for virtual test can not only provide the display function of virtual test scene for the test personnel, but also provide the sensing basis for the virtual sensor, and can interact with other virtual environments [41], which requires the virtual land environment to be able to provide accurate ground information in addition to the traditional texture and elevation information. For this reason, we propose a virtual land environment construction scheme as shown in Fig. 1. It is based on multi-source satellite ground observation data and realized by four steps, including multi-source data fusion, ground object recognition, and so on.
Based on hyperspectral image (HSI), multispectral image (MSI), panchromatic image (PAN), and other optical earth observation data and radar earth observation data such as InSAR, the construction of virtual land environment is completed through four steps: (A) temporalspatial-spectral fusion, (B) ground truth recognition, (C) digital elevation model (DEM) extraction, and (D) virtual land environment synthesis. In step A, the HSI is mainly used to obtain the high "temporal-spatial-spectral" resolution HSI which can meet the requirements of building virtual land environment by data-level fusion with PAN, MSI, and other homogeneous remote sensing images with complementary advantages, which is essentially a joint application of multi-sensor data [42,43].
Step B obtains accurate ground truth information based on the HSI generated in step A. In step C, InSAR image is used to acquire DEM needed for constructing 3D terrain. InSAR is a kind of ground observation technology which has developed rapidly in recent years and has the characteristics of large-scale, high-precision, and all-weather. Finally, step D synthesizes the virtual land environment for the virtual experiment by using the ground truth, elevation, and texture. Among the B step, ground truth recognition, is a key step, which is responsible for providing accurate ground truth information and texture image for the virtual land environment. The proposed HSI classification method in this article belongs to this step, the ground truth information is obtained through the classification of HSI and the texture image is gotten by pseudo-color image synthesis based on the spectra selected by information measure.

Determination of candidate spectra sets
The amount of entropy was measured by the degree of uncertainty, i.e., the probability of the occurrence of discrete random events [44]. When the amount of information was larger, the redundancy was smaller. Information measurement based on Shannon's communication theory has proven to be very effective in identifying the redundancy of high-dimensional data sets. When these measurements were applied to hyperspectral images, each channel was equivalent to a random variable X. In addition, all the pixels were considered as events x i of X. The channels with less information were excluded by Shannon entropy to determine the candidate spectra sets as follows: First, the entropy of each spectrum of the hyperspectral image H(B i ) was calculated. The random variable B i is the ith spectrum (i = 1, 2, . . . n), x i is the pixel of the ith spectrum, p B i (x i ) is the probability density function of the spectrum B i , b is the logarithmic order.
Second, the local average of the entropy of each spectrum was defined. In the equation, m is the window size, indicating the size of the neighborhood (2020) 2020:59 Page 4 of 16 Fig. 1 The process of constructing a virtual land environment using multi-source satellite Earth observation data Finally, the spectrum B i that met the following conditions was retained. In the equation, σ is the threshold factor. The spectrum was considered redundant if its entropy was higher or lower than the upper or the lower floating threshold factor σ of the local average H m (B i ).
As shown in Fig. 2, the horizontal axis represents the number of spectra, i.e., the spectral dimension, the vertical axis represents the entropy of each spectrum, and the blue curve is the entropy curve. The smoothness of the entropy curve determined the values of the window size m the threshold factor σ . If the curve was smooth, the change of the adjacent spectra information of the hyperspectral image was small, the uncertainty of the spectrum information was small, the number of spectra falling outside the relevant range was small, the probability of having an uncorrelated spectrum was small, and the number of redundant spectra was small. In this case, the smaller values of σ and m were chosen to improve the ability to exclude redundant spectra. On the contrary, if the curve fluctuated greatly, the information of the adjacent spectra of the hyperspectral image changed dramatically, the uncertainty of the spectrum information was large, the number of spectra falling outside the correlation range was large, the probability of having an uncorrelated spectrum was large, and there were many redundant spectra. In this case, the larger values of σ and m were chosen to prevent the elimination of the spectra with valid information.
Next, based on the calculated entropy, the CIE 1931 Standard Chroma Observer Color Matching Function (CMF) [45], which described the visual color characteristics of human eyes, was used to complete the determination of candidate spectra. The CMF determined the amount of light (red, blue, and green) to achieve the same visual effect as the monochromatic light corresponding to the wavelength. By applying CIE color matching to the hyperspectral images in the visible range, the hyperspectral images can be visualized as images with the correct color matching [46]. By setting the threshold value t for the CMF coefficients of the three primary colors, the candidate spectra based on the three primary color channels, Set t R , Set t B , and Set t G , were obtained.
In Fig. 3, two spectral thresholds (t = 0.1, t = 0.5) were set for the trend curve of the red light CMF coefficient. When the CMF coefficient was above the threshold, the corresponding spectra were preserved. However, it was very challenging to set the parameter t without a specific application. In this paper, an automatic threshold method was used, in which the optimal threshold of t was defined as the value to maximize the amount of discarded information. S t discard was defined as the set of channels that were discarded by thresholding the CMF. S t selected was the complementary set of S t discard . The optimal threshold was defined as: where H S t discard is the total entropy of the discarded spectra obtained by the above derivation, H S t discard is the total entropy of the selected spectra. Based on these results, the spectra were initially selected.

Spectra selection
Mutual information is a measure of useful information. It is defined as the contained amount of information in a random variable another random variable. The mutual information between two random variables X and Y is defined as follows: where P X (x i ) is the probability density function of X, is the joint entropy of two random variables X and Y. Further, Bell proposed that the mutual information of three random variables X, Y, and Z can be defined as [47]  where H(X, Y, Z) is the third-order joint entropy of the three random variables X, Y, and Z.
The above principle is also applicable to hyperspectral images. The information of one channel can increase the mutual information between the other two channels. In this case, when the overlapped information between the two channels is less, the interdependence degree between the two random variables is lower, and more information is contained. Two criteria need to be considered to reduce the dimensionality of hyperspectral images, i.e., maximum information and minimum redundancy.
Pla proposed to standardize the mutual information [48]. In this paper, the kth-order normalized information (NI) of the spectrum S = {B 1 , . . . , B k } was used as the standardized mutual information. NI was defined as follows: where I(S) is the mutual information among the spectra of B 1 to B k and H(B i ) is the entropy of the spectrum B i .
In the previous section, the threshold t was set for the CMF coefficients of the three primary colors to obtain three candidate spectra, i.e., Set t R , Set t B , and Set t G . When the value of the mutual information was smaller, the amount of information contained in the selected spectra was larger, and the dimensionality reduction effect on the hyperspectral image was better. Based on the mutual information, the three spectra, i.e., were obtained to minimize NI 3 (x R , y B , z G ) . These three spectra contained the most important information of the hyperspectral images.

Pseudo-color images synthesis
To get the texture information of in the construction of virtual land environment, we adopt the pseudo-color images synthesis technology, which is based on the three spectra selected from the previous section. Besides, the pseudo-color images can also be convenient for human eyes to observe. As an image enhancement technology, the pseudo-color images synthesis technology can synthesize multispectral monochrome images into pseudo-color images by adding or subtracting colors.
Using the three grayscale images, x R , y B , z G , the pseudocolor images synthesis can be performed based on information measurement according to Section 3.1.1.
As shown in Fig. 4, three grayscale images of the hyperspectral three-dimensional color image were saved by the three-color synthesis method [49], and the three colorchanging functions corresponding to the three colors of red, green, and blue were set to R(x, y), G(x, y), and B(x, y).
R(x, y) = Red(Gray 1 (x, y)), G(x, y) = Green(Gray 2 (x, y)), where Gray i (x, y), (i = 1, 2, 3) represents the grayscale data of three grayscale images, Red(Gray 1 (x, y)), Green(Gray 2 (x, y)), and Blue(Gray 3 (x, y)) indicate that the color conversion of red, green, and blue is corresponding to the three grayscale images, respectively. Finally, the three color-converted images were combined into a pseudo-color image. The generated pseudo-color image can also improve the fidelity through color correction technology [50].

HSI classification based on information measure
As shown in Fig. 5, the classification process of hyperspectral image based on information measure (IM for short) can be divided into the following two main steps: Firstly, three candidate spectra were selected based on entropy and color matching function and three most important spectra were selection based on the minimum mutual information. Then the grayscale images of these three spectra were synthesized into the pseudo-color images. These three spectra contained the most important spectral information. Then neighborhood of the pixel to be classified was finally extracted to generate a patch with spatial-spectral information. Secondly, the patches were input into the CNN for training and testing. As shown in Fig. 6, assume that the size of hyperspectral data is I 1 × I 2 × I 3 . From the three spectra x R , y B , z G , a m × m × 3 patch with spatial-spectral information was extracted and put into CNN. From the perspective of the two spatial dimension I 1 × I 2 , the patch to CNN contained three layers of spatial information with the size of m × m. From the perspective of spectral dimension I 3 , the patch contained all the spectral information of the three spectra x R , y B , z G . Therefore, the method uses all the spatial-spectral information of the three spectra x R , y B , z G , and the number of input channels to CNN was in_channels=3. CNN network consists of two convolution layers (Conv1 and Conv2), two pooling layers (Max-pooling1 and Max-pooling2), and two full connection layers (Full_connected1 and Full_connected2). Finally, the classification of hyperspectral images is realized by using Softmax classifier. The training stage is as follows: Step 1. Random initialization of network parameters, including Conv1, Conv2, Max-pooling1, Max-pooling2, and so on.
Step 2. The input patch is propagated forward through the convolution layers, the pooling layers, and the full connection layers to get the output value. Step 3. Calculate the error between the output value and the target value of the network.
Step 4. When the error is greater than the expected value, the error is propagated back to the network, and the errors of the full connection layers, the pooling layers, and the convolution layers are obtained in turn; when the error is equal to or less than the expected value, the training is ended.
Step 5. Update the network parameters according to the obtained error and go to step 2).

HSI classification based on information measure and enhanced spectral information
On the basis of the previous methods, we propose HSI classification based on information measure and enhanced spectral information (IM_ESI for short) as shown in Fig. 7. Compared with the IM method, the main difference of IM_ESI is extracting all the spectral information of the pixel to be classified (the patch central pixel) and then attach them to previous m × m × 3 patch by combing, intercepting, and deforming the spectral information, generating a new m × m × 6 patch for CNN' input finally. As shown in Fig. 8, Firstly, the one-dimensional spectral information of the patch center with the size of 1 × 1 × I 3 was repeatedly superimposed by n times to obtain the one-dimensional spectral information with the size of 1 × (n × I 3 ). A one-dimensional spectral vector with the same size as the two-dimensional spatial information m × m was intercepted. The one-dimensional spectrum with the size of 1 × (m × m) was deformed into a twodimensional spectral matrix of m × m and superimposed in three layers. The obtained results were combined with the two-dimensional spatial information with the size of Spectra selection based on the minimum mutual information m × m in three layers of the three spectra x R , y B , z G to obtain the spatial-spectral patch with the size of m × m × 6. The obtained spatial-spectral patch was put into the CNN. The number of input channel in CNN model was in_channels=6. The structure of CNN network is the same as that of the previous network.

Datasets
The Salinas dataset was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in the Salinas Valley, California, USA. The uncorrected hyperspectral image data contained 224 spectra, and the corresponding two-dimensional floor space of each spectrum contained 512 × 217 pixels with a spatial resolution of 3.7 m per pixel. The corrected spectral dimension was reduced to 204 by removing the 20 spectra (108-112, 154-167, and 224) which covered the water absorption region. The ground objects of the hyperspectral image were classified into 16  the hyperspectral image data had 102 spectra after the noise spectrum was deleted. In addition, the twodimensional ground space corresponding to each spectrum contained 1096 × 1096 pixels. The area contained 9 types of ground objects. The datasets are gotten from the website (http://www.ehu.eus/ccwintco/index. php?title=Hyperspectral_Remote_Sensing_Scenes). There were both similarities and differences between the two datasets. Both the Salinas and Pavia Center datasets had large sizes, but the size of Pavia Center dataset was extremely large. The 2D image size of Pavia Center dataset was about 11 times as large as the size of the Salinas dataset. Salinas dataset mainly reflected the information of vegetation (such as Fallow and Celery) and included a rich variety of features with the regular distribution. Pavia Center dataset mainly reflected the landscape information of the urban landscape and included small types of features with strongly irregular distribution. The Salinas dataset contained rich spectral information while the Pavia Center dataset contained less spectral information. The spatial resolution of both datasets was high, and the spatial resolution of the Pavia Center dataset was higher than that of the Salinas dataset. In addition, through principal component analysis, it was found that the first principal component of both two datasets contained far more information than the other bands, indicating that the image information had a concentrated distribution. According to the characteristics of the two datasets, 5% of all labeled pixels in the Salinas data set were randomly selected as training samples, and the remaining 95% of the pixels were used as test samples. The distribution of sample size in Salinas dataset is shown in Table 1. Nine percent of the pixels of the Pavia Center dataset were randomly selected and used as training samples, and the remaining 91% of the pixels were used as test samples. The distribution of sample size in Pavia Center is shown in Table 2.

Experiments and result analysis 4.2.1 Parameter selection experiments
In the spectra selection algorithm based on information measure, two key parameters needed to be manually adjusted, i.e., the window size m and the threshold σ . Both parameters determined the excluded spectra and had a significant impact on the CNN classification results of hyperspectral images. Table 3, for the Salinas dataset, the window size was set to m = 11. Under different σ values, the classification  accuracy of the IM method (using OA as an example) was different.

Experiment on parameter σ As shown in
From the results in Table 3, the classification performance was the best when σ = 0.05 and σ = 0.15. However, σ had large fluctuations after these two peaks, resulting in the decrease of OA. In addition, when σ > 0.2, the OA results were the independent on the change of σ . Thus, when σ was large enough, different values of σ had almost the same effect on the classification results. Through repeated experiments, the threshold σ was set to a medium value, i.e., σ = 0.1.

Experiment on parameter m
The window size m also had an important influence on the proposed method. The threshold was set to σ = 0.1. The effect of different window sizes m on OA is shown in Table 4.  By comparing the OA in Table 4, when m > 14, the overall classification accuracy rate was maintained at a high level. Since the entropy curvature of the Salinas dataset changed abruptly, a larger window size (m = 20) and a medium σ (σ = 0.1) were more suitable. Because the distributions of the image information in the Salinas dataset and the Pavia Center dataset were relatively concentrated, the window size was set to m = 20 and the threshold was set to σ = 0.1.

Pseudo-color images synthesis experiments
For the Salinas dataset, the three spectra x R , y B , z G selected by the above parameters based on the information measure corresponded to the 32nd, 61st, and 66th spectra, and the mutual information NI 3 (x R , y B , z G ) was minimized to NI 3 (32, 61, 66) = 0.545. For the Pavia Center dataset, the three selected spectra were the 6th, 68th, and 28th spectra, and the minimum mutual information was NI 3 (6, 68, 28) = 0.5358. A pseudo-color image was generated from the grayscale images of the three selected spectra on the two datasets as shown in Fig. 9.

Classification performance comparison experiments
In the IM method, the patch sizes for the Salinas dataset and the Pavia Center dataset were 27 × 27 × 3 and 21 × 21 × 3. In the IM_ESI method, the patch size of the spatial information for the Salinas dataset was a 27 × 27, and the spectral information of each pixel point was represented by a one-dimensional vector with the size of 1×204. Thus, the spectral information was repeated four times and accumulated to obtain 1 × 816. Then a one-dimensional spectral vector with the size of 1 × 729(27 × 27 = 729) was intercepted and transformed into a 27 × 27 twodimensional spectral matrix. Copying them three times and get a 27 × 27 × 3 three-dimensional spectral patch. Combining them with the 27 × 27 × 3 patch obtained by IM method, then getting a new 27×27×6 patch for CNN's input. Similarly, the patch size for the Pavia Center dataset in the IM_ESI method was 21 × 21 × 6.
As shown in Table 5, two CNN networks were created for the two hyperspectral image datasets. In both networks, the ReLU activation function and the maximum pooling mode were selected. The random inactivation (dropout) prevented or mitigated the overfitting. The  9 Pseudo-color image generated by grayscale image of the three selected spectra based on information measure. a Pseudo-color image generated by the grayscale image of the 32nd, 61st, and 66th spectra of the Salinas dataset. b Pseudo-color image generated by the grayscale image of the 6th, 68th, and 28th spectra of the Pavia Center dataset value of keep_prob (keep_prob = 0.5) indicated the probability that the neurons were selected. In the situation of keep_prob = 0.5, 50% of the data was discarded. Both networks were randomly initialized by the normal distribution with the known mean and standard deviation. After the initialization was completed, the training samples were input into the networks and the weights of the networks were updated. The convolutional layer (Conv1, Conv2), the pooling layer (Maxpool1, Maxpool2), and the full connection layer (Fc1_units, Fc2_units) of the CNN model were set according to Table 5. In the above table, 128@5 × 5 indicated that there were 128 convolution kernels with the size of 5 × 5 in this layer, and strides = 2 meant that the step size was 2.The learning rate in the training of CNN networks for both datasets was set to 0.005, the training times were set to 260 for Salinas dataset and 100 for Pavia Center dataset. After the network was trained, the objects to be classified can be input into the corresponding CNN classification model to predict the category of the object. In order to fully verify the validity of the method, the proposed IM and IM_ESI methods were compared with the following similar methods: (1) The CNN classification method based on spectral information of hyperspectral image (referred to as SPE), which was developed according to the concept of "classification method based on spectral information gray image" in [51]; (2) CNN classification method based on hyperspectral image spatial information (referred to as PCA1), which was developed according to [25] using 2D-CNN to extract spatial features; (3) two types of CNN classification methods for hyperspectral image with fused spatial-spectral information, i.e., CNN classification method with the fusion of first principal component spatial information and spectrum information (referred to as PCA1_SPE) and CNN classification method based on PCA first three principal components spatial-spectral information (referred to as PCA3), which were obtained according to the combination idea of spatial-spectral information proposed by [52]. In the experiment, three evaluation indicators (total classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient) were used to evaluate the classification performance [53]. OA is the ratio of the number of correctly classified pixels to the total number of pixels. AA is the ratio of the number of correctly classified pixels in each class to the total number of pixels in such class. Kappa coefficient is an index based on the confusion matrix to measure the classification accuracy. The kappa (X i+ · X +i ) is the sum of the products of X i+ and Experiment on Salinas dataset The comparison of the classification performance on the Salinas dataset is shown in Table 6. From Table 6, the IM and IM_ESI methods had the best classification accuracy in the Salinas dataset.
Especially, compared to the SPE, PCA1, and PCA3 methods, the value of OA using the IM_ESI method was improved by 12.07%, 4.69%, and 1.01%, respectively. The excellent performance on AA and kappa coefficients fully demonstrated the stability and accuracy of IM and IM_ESI methods Figure 10 a-f are the classification results of the SPE, PCA1, PCA1_SPE, PCA3, IM, and IM_ESI methods on the Salinas dataset, and g is the ground truth of Salinas dataset.  From Fig. 10, it can be seen more intuitively that the overall classification effect by the IM and IM_ESI methods on the Salinas dataset is significantly better than that by the SPE, PCA1, PCA1_SPE, and PCA3 methods. Especially, for the highly similar feature classes, such as classes 1 and 2 (Brocoli_green_weeds_1 and Brocoli_green_weeds_2) and classes 8 and 15 (Grapes_untrained and Vinyard_untrained), the classification effect was significantly improved, and misclassification for the scenario of highly similar features was significantly reduced. The results indicated that the IM and IM_ESI methods had outstanding advantages in classifying highly similar feature information.

Experiment on Pavia Center dataset
The comparison of the classification performance of the SPE, PCA1, PCA1_SPE, PCA3, IM, and IM_ESI methods on the Pavia Center dataset is shown in Table 7. From Table 7, the IM and IM_ESI methods had the best classification accuracy, both of which were above 99%. Especially, the classification accuracy of IM_ESI method was 8%, 7.36%, and 0.91% higher than that of the SPE, PCA1, and PCA1_SPE methods, respectively. The average classification accuracy and kappa coefficient of IM and IM_ESI were also the highest, indicating that the predicted classification results of various types of features were more consistent with the real object information. Therefore, the results on the Pavia Center dataset further demonstrated the stability and accuracy of the classification methods IM and IM_ESI on large data sets. Figure 11 a-f are the classification results of the SPE, PCA1, PCA1_SPE, PCA3, IM, and IM_ESI methods on the Pavia Center dataset, and g is the ground truth of Pavia Center dataset.
From Fig. 11, the IM and IM_ESI methods had outstanding classification performance with the sufficient training samples in the Pavia Center dataset. In both methods, almost all the samples were correctly classified.
Based on the performance comparison between Salinas and Pavia Center datasets, the hyperspectral image classification method based on information measurement (IM and IM_ESI) had the highest and most stable classification results with both normal and very sufficient sample size. With the normal sample size, the classification performances was good, the classification accuracy of highly similar feature information was high, and the misclassification of information was very rare or almost none. As the sample size got richer, the classification performance was even more superior. The features extracted by the traditional PCA are the linear addition of the original features. It requires that the data to be dimensionally reduced be linearly correlated. It cannot solve the problem that the data has no linear correlation. Therefore, the nonlinear features embedded in the HSI data cannot be preserved by the linear PCA model. In this paper, from the perspective of information measurement, three spectra with the largest amount of information and the most complementary information are selected. Therefore, experiments show that the information measurement methods used in this paper outperformed the PCA ones. Research shows that feature extraction can help machine learning to improve the generalization performance. As we all know, besides PCA, there are some common dimensionality reduction methods, such as Kernel-PCA (KPCA) [54], locally linear embedding (LLE) [55], and so on. KPCA generally performs better than PCA, the reason lies in the fact that KPCA can explore higher order information of the original inputs than PCA. KPCA implicitly considers the high order information of the original inputs and extracts more principal components, more principal components could also be extracted in KPCA, eventually resulting in the best generalization performance. LLE is much better than PCA in dealing with so-called manifold dimensionality reduction. LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE can learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text. So, in the future research, we can try to use KPCA, ICA, and LLE to replace the methods based on information measurement or compare our in this paper with the classification methods based on these dimensionality reduction methods for further experimental testing.

Discussion
The proposed HSI classification method that performs better than the SPE, PCA1, PCA1_SPE, and PCA3 methods is to provide accurate ground truth information and texture image for the virtual land environment. Since our method is based on spatial-spectral information of three spectra that contain the most important spectral information. If the spectral information is scattered that three spectra cannot contain the most important spectral information, our method may not work well. The method has certain limitations on the application of the actual scene and may not be as good as the effect obtained on the experimental image. In view of the limitation, we will further improve our method to enhance its applicability and generality. Although this method has certain limitations, it can provide a reference direction for us to study the classification method of hyperspectral images in the future.

Conclusions
In this paper, a CNN classification method based on information measure (IM) was proposed. In addition, the principle and complete implementation process of the classification method was introduced. Firstly, the candidate spectral sets was determined based on entropy and colormatching function. The minimum mutual information was calculated to achieve the accurate selection of three spectra, and the grayscale image of the selected spectrum was synthesized to pseudo-color image for the texture information of virtual land environment. Secondly, a hyperspectral image classification method (IM_ESI) was proposed by combing spatial-spectral information of the IM method and enhanced spectral information. The spatial-spectral information of the selected spectra based on information measure was combined with all spectral information of the pixel to be classified after deformation and superposition to obtain new spatial-spectral patch, which was input into the CNN network. Through the parameter selection experiment, the influence of two key parameters (window size m, threshold σ ) on the performance of the proposed method was analyzed, and the CNN network parameters of IM and IM_ESI were reasonably selected. Finally, a comparison experiment was performed on Salinas and Pavia Center datasets. The results showed that the proposed method (IM and IM_ESI) had good classification performance regardless of the sample size and distribution. The classification accuracy was high for the information with similar features, and the classification performance on the data with sufficient samples was more superior.