A new deep sparse autoencoder for community detection in complex networks

Feature dimension reduction in the community detection is an important research topic in complex networks and has attracted many research efforts in recent years. However, most of existing algorithms developed for this purpose take advantage of classical mechanisms, which may be long experimental, time-consuming, and ineffective for complex networks. To this purpose, a novel deep sparse autoencoder for community detection, named DSACD, is proposed in this paper. In DSACD, a similarity matrix is constructed to reveal the indirect connections between nodes and a deep sparse automatic encoder based on unsupervised learning is designed to reduce the dimension and extract the feature structure of complex networks. During the process of back propagation, L-BFGS avoid the calculation of Hessian matrix which can increase the calculation speed. The performance of DSACD is validated on synthetic and real-world networks. Experimental results demonstrate the effectiveness of DSACD and the systematic comparisons with four algorithms confirm a significant improvement in terms of three index Fsame, NMI, and modularity Q. Finally, these achieved received signal strength indication (RSSI) data set can be aggregated into 64 correct communities, which further confirms its usability in indoor location systems.

Based on the deep sparse autoencoder and quasi-Newton method, we construct a community detection architecture, which improves the information loss in high-dimensional reduction. The proven architecture-DSACD first reduces the dimension of the highdimensional matrix and then realizes community discovery by a deep sparse autoencoder. DSACD also improves the accuracy of K-means algorithm. In the application, the real community data sets for LBS are trained by the deep sparse autoencoder and optimized by the quasi-Newton method. The time loss in the process of community discovery is reduced to improve the efficiency of the algorithm; meanwhile, the accuracy is guaranteed.
The rest of this paper is organized as follows: Section 2 explains the proper nouns and algorithms appearing. In Section 3, the experimental process are introduced. DSACD performs confirmatory experiments through multiple experimental sets, including several parameter experiments to optimize the algorithm. The results of the experiment were evaluated by citing several evaluation criteria. Section 4 gives results discussion and an application. In Section 5, we consider the idea of future development based on these results.
• We construct a deep sparse autoencoder with L-BFGS method for community detection to accelerate the process of clustering community structures in large data set.
• In DSACD, a similarity matrix is constructed to show the indirect connections between nodes, a deep sparse automatic encoder is constructed with L-BFGS algorithm based on unsupervised learning to reduce the dimension and extract the feature structure of the network, and comparison experiments with four algorithms show a significant improvement in terms of three index F same , NMI, and modularity Q of DSACD.
• We creatively design community detection experiments-a location-based service networks. The results show that DSACD can divide the partition into 64 communities which can prove its practicability.

Preliminaries
The definitions of DSACD studied in this paper are given as follows.

Matrix preprocessing
Let G = (V , E) be the graph, where V = v1, v2, ..., vn represents the set of nodes (vertices) in the graph and E represents the set of edges in the figure. Let N(u) be the neighbor nodes set of node u. Let matrix A =[a ij ] n×n be the adjacency matrix of graph G, and the corresponding elements of the matrix represent whether there are edges between two points in graph G. For example, a ij equals 1 indicates that there exists an e ij . If a ij equals 0, the indication is that there is no e ij .
In the small graph, the adjacency matrix A can directly calculate the community relationship in the graph using a clustering algorithm such as K-means, and the result is more accurate (see Section 3). However, the adjacency matrix records only the relationship between adjacent nodes, and does not express the relationship between the node and its neighbors, or even more distant nodes. For any two nodes in the community, even if they are not connected to each other, it is possible to have the same community. Therefore, if the adjacency matrix is directly used as the similarity matrix for community partitioning, the complete community relationship cannot be reflected. If the adjacency matrix is directly clustered, the information will be lost. In this paper, on the premise of definition: the similarity matrix which can express the non-adjacent information matrix is calculated by transforming the adjacency . Based on this, the definitions are given as follows.
Definition 1 Let a network graph be G = (V , E) for ∀v ∈ V . If the number of the shortest path from the node v i to another node v j is s, then, the node v i can jump to node v j through s hops. That is, hop is the number of least traversed edges from the node v i to v j .
As shown in the network of Fig. 1, node v 1 reaches v 2 , v 3 , or v 6 after one hop and arrives at v 4 or v 5 after two hops. For an instance, from v 1 → v 2 , the number of least traversed edges is 1, so the hop count is 1; from v 1 → v 5 , the number of least traversed edges is 2, and the hop count is 2.

Definition 2
In G = (V , E), the similarity between two points v i and v j is defined as formula 1: where the hop number from v i to v j , and τ is the attenuation factor. The node similarity decreases with the increase in the hop threshold s, τ controls the attenuation rate of the similarity, and the velocity of the node similarity relationship decays faster with the increase in τ .

Definition 3
In G = (V , E), its (similarity matrix) S =[ s ij ] n×n is calculated by the node similarity between two points in G where The similarity matrix obtained by processing the adjacency matrix by the hop count and the attenuation factor can better reflect the relationship between the distant nodes in the high-dimensional matrix, and the results of the community discovery are also improved.  Obviously, the selection of the hop count threshold and the attenuation factor will have an important impact on the similarity matrix. The selection of hop count is obtained from the parameter learning process, which is explained at Section 3.6. Section 3 of this paper will set up experiments on these two parameters to explore the impact of different parameters on the results.

Deep sparse autoencoder
Based on a sparse autoencoder, the structure of deep sparse autoencoder is shown in Fig. 2. The output of the previous layer, that is, the code h after dimension reduction, is shown in Fig. 2, as the input of the next layer. Then, the dimensions are reduced one by one.
Autoencoder [25] is an unsupervised learning artificial neural network that can learn the efficient encoding of data to express the eigenvalues of the data. The typical usage of the AE is to reduce dimensionality. Figs. 3 and 4, given an unlabeled data set {x (i) } m i=1 , the automatic encoder learns the nonlinear code through a two-layer neural network (input layer  is not counted) to express the original data, the [25] training process uses the back propagation algorithm, and the sign of the end of the training is that the difference between the learned nonlinear code and the original data is minimized.

Definition 4 As shown in
The automatic encoder is composed of two parts: the coder (encode) and the decoder (decode). The encoding process is from the input layer to the hidden layer. At this time, the input data are subjected to dimensionality reduction to form a code, which is encoded as the output of the encoder, and then, the code is used as an input of the decoder for decoding, and the decoded result has the same dimension as the input data, used as the output of the decoder. After the output result is obtained, the output result is compared with the input result, the reconstruction errors are calculated, and then, the back-propagation algorithm is used to adjust the weight matrix of the automatic encoder. The reconstruction errors are calculated again and iterated continuously until the number of iterations or the reconstruction errors are less than the specified range. The output is equal to or close to the input result. The process of training a neural network using a back-propagation algorithm is also referred to as the minimization of the reconstruction error. Finally, the output of the encoder, i.e., the encoding, is taken as the output of the automatic encoder.
Specific steps are as follows: Let X be the network graph G similarity matrix with dimension n. As the input matrix, where x i ∈ (0, 1), X ∈ R ( n × n). x i ∈ R ( n × 1) represents the ith column vector in X, is the weight matrix of the input layer [26], and W 2 ∈ R ( n × d) is the weight matrix of the hidden layer [27].
b ∈ R ( d × 1) is the offset column vector of the hidden layer [27]. c ∈ R ( n × 1) is the offset column vector of the input layer [27]. The output h of the coding layer is obtained by formula 2: where h i ∈ R ( d × 1) is the encoded ith column vector. τ is the activation function, and the sigmoid function [28] is chosen as the activation function τ , which is shown by formula 3.
In formula 3, z = W T X. The matrix h obtained at this time is a matrix after dimensionality reduction. The output z of the decoding layer is obtained by formula 4: where z i ∈ R(n × 1) is the decoded ith column vector. τ is the activation function. The resulting matrix z is the same as the X dimension of the input matrix.
Combining the formula 2 with the formula 4, the reconstruction error is obtained: When the activation function is sigmoid, the mapping range of the neurons is (0, 1). When the output is close to 1, it is called active, and when the output is close to 0, it is called inactive [29]. In a sparse autoencoder, sparseness restrictions are added to the hidden layer. A sparse restriction means that neurons are suppressed most of the time, that is, the output is close to 0. A sparse expression has been successfully applied to many applications, such as target recognition [30,31], speech recognition [32], and behavior recognition [33]. The sparsity calculation method is as follows: First, the average value of the output of the coding layer ρ j is calculated and h j (x) denotes the output value of the neuron for the jth neuron (h j ) of the hidden layer when the input is x [34]. The average value of the neuron output in the hidden layer is: To achieve sparsity, it is necessary to add a sparsity limit, which is achieved by: where ρ is the sparsity parameter, generally, ρ 1, such as 0.05. When formula 7 is satisfied, the activation value of the hidden layer neurons is mostly close to 0.
A sparsity limit is added to the reconstruction error, that is, a penalty term is added to the reconstruction error, and ρ j deviating from ρ is punished. The penalty function is as follows: where d represents the number of hidden layer neurons. This formula is based on Kullback-Leibler divergence (KL [35]), so it can also be written as formula 9: In summary, formula 8 and formula 9 are combined to obtain the following: When ρ j = ρ, the penalty function KL(ρ ρ j ) is 0. When ρ j = ρ is far from ρ, the function monotonically increases and tends to infinity, as shown in Fig. 5: (2020) 2020:91 Page 8 of 25

Fig. 5 KL divergence function
Therefore, by minimizing the sparse term penalty factor, that is, formula 10, ρ j is closed to ρ. At this point, the reconstruction error is updated to formula 11: where β is the weight of the sparse penalty factor. The training sparse autoencoder minimizes the reconstruction error by the backpropagation algorithm, that is, formula 11.

The deep sparse autoencoder for community detection
Based on the deep sparse autoencoder shown in Fig. 2, the data are preprocessed first, and the similarity matrix S 0 ∈ R (n×n) is obtained by formula 1. The similarity matrix is used as the input of the deep sparse autoencoder, then, the number of layers T of the deep sparse autoencoder is set and the number of nodes per layer is input into the sparse autoencoder with the hidden layer as d 1 as the input data of the first layer. After the first layer of training, the dimensioned matrix S 1 ∈ R (n×d 1 ) is obtained, then, S is input into the second layer of the deep sparse autoencoder, and then, the dimension is reduced to obtain S 2 ∈ R (n×d 2 ) , etc., until the last layer. The low-dimensional feature matrix S T ∈ R (n×d T ) is obtained, and finally, the community is obtained by K-means clustering. See algorithms 1 and 2 for the detailed process.
Algorithm 1 hop count threshold S, attenuation factor σ , and formula 1 are used to compute the similar degree matrix sim of A ∈ R (n×n) . Algorithm 1 is used to obtain the similarity matrix by computing the similarity of x with other nodes in V.
Algorithm 2 uses the deep sparse autoencoder with L-BFGS in which the layer number is T to reduce the dimension for a similar degree matrix, and then, the feature is extracted, Sim(x, y) = 0 end if end for get the similarity matrix X based on the hop number and the low-dimensional characteristic matrix S T ∈ R (n×d T ) is obtained. Algorithm2 is used to reduce the similarity matrix and obtain the characteristics.
The K-means algorithm is used in S T to obtain the cluster result Coms = {C 1 , C 2 , · · · , C k } and then return it.
After the K-means algorithm is used, the communities {C 1 , C 2 , · · · , C k } are obtained, and then, the result is returned.
In the proposed algorithm, the inputs include the adjacent matrix A ∈ R (n×n) of the G = (V , E), k-the number of communities, S-the hop count threshold, σ -the attenuation factor, and T-the layer number of deep sparse autoencoder and nodes in every layer

Experimental design
Since this experiment is a test of the community detection algorithm, the ground-truth communities are selected for verification, so that the accuracy of the algorithm can be analyzed and verified accurately.
This experiment used four real data sets: Strike [36], Football [37], LiveJournal [38], and Orkut [39]. Among them, Strike is a 24-striker relationship table on wood processing projects. The frequency of discussion for strike topics between two people is the rules, which are added. If the frequency is high (there are specific criteria for evaluation during the investigation, no detailed explanation will be given here), then, a connection is established. Football is the timetable for the American Football Cup (FBS) held by the American College Sports Association (NCAA) in 2006. In the NCAA relationship network, if two teams played games, the connection is established.
LiveJournal is a free online blogging community where users can add friends. LiveJournal can create groups. When collecting community information, the software classifies it according to cultural background, entertainment preferences, sports, games, lifestyle, technology, etc.  Orkut is a social service network launched by Google. Friend relationship and group of friends can be constructed.
On the social network, nearly 4 million points and 30 million edges were extracted, and 8 communities with the largest number of nodes were selected to conduct experiments as data sets. Detailed information on each experimental set is shown in Tables 1 and 2.

Evaluation index
To determine whether the clustering result is accurate, it is necessary to evaluate the clustering results Coms = {C 1 , C 2 , · · · , C k }. The evaluation method selects F same [3] and NMI. Both methods are evaluated according to the real community GroundTruth = C 1 , C 2 , · · · , C l , in where l is the true number for communities. Moreover, Q [39,40] was used to evaluate the quality of the community.
Evaluation standard F same : The community evaluation standard F same is obtained by calculating the intersection of each real community and each cluster community and averaging these values. The formula is as follows: where in the graph G, the number of nodes is n.

Evaluation standard NMI:
The NMI is the normalized mutual information. The formula is as follows: where N ∈ R (n×n) is the confusion matrix, the rows represent the real community, and the columns represent the communities found. N i j represents the point of overlap between the real community C i and the discovery community C j . N ·j represents the sum of all the elements in column j, and N i represents the sum of all the elements in row i. If the community is found to be in full agreement with the real community [41], the NMI value is 1. If the community is found to be completely different from the real community, the NMI value is 0.

Evaluation standard Q:
Modularity Q is a measure of how well a community is found. The formula is as follows: where E in i is the inner edges number of the community C i , E out i is the outer edge number of the community C i , and m is the total edge number in the graph G.

Analysis experiments
The experiment consists of four parts, which are the volatility exploration experiment based on the DSACD, the comparison experiment with other algorithms, the parameter experiment, and the visualization experiment. The volatility exploration experiment is to show the fluctuation of our algorithm on different data sets, which can explain the stability for our algorithm on large data set. We compare DSACD with CoDDA and Kmeans based on three performance evaluation standards as F same , NMI, and modularity Q. For explaining the result of parameter selection, the parameter experiment is given. At last, we use the visualization experiment to show the clustering results.

Volatility exploration analysis
According to Algorithm 1, community discovery was performed on the four data sets, and the results were evaluated using F same , NMI, and Q. However, since the selection of the center point of the K-means algorithm is random, where the weight matrix of neurons in the hidden layer and the output layer for the depth sparse autoencoder is also initialized by random numbers, the proposed algorithm is random. To be able to react to the smoothing of the data, this paper investigated the fluctuation of the data, and the results are displayed in Fig. 6.
The variance of each data set is shown in Table 3.  By performing 100 experiments on the Strike data set and the Football data set, Fig. 6 and Table 3 show that the clustering results have volatility. Taking the NMI value as an example, the small data set [36] has a variance of 26.46, and the larger data set [37] has a variance of 0.96, showing that multiple experiments are needed in small data sets to reduce the impact of fluctuations. In addition, the variance will decrease with the increase in the data set. This result also indicates that the algorithm has higher stability on the large data set, and the repetition number of the different experimental sets can be flexibly changed. Table 4 describes the comparison of parametric experimental cluster results. The table shows that proved deep sparse autoencoder for community detection can significantly improve the cluster results and quality.

Algorithm comparison
In this experiment, the K-means algorithm and the similarity matrix were directly clustered, the CoDDA algorithm and the DSACD were compared, and the NMI value was used for evaluation. The hop count threshold of the CoDDA algorithm, the attenuation factor, and the value of the deep sparse autoencoder use the optimal values in Table 5. The Table 6 shows experimental results.
Note: the number of iterations of the Football and Strike datasets is 100, and the number of iterations of the LiveJournal dataset is 5.
As Table 6 shows, the DSACD has higher cluster accuracy and cluster quality, among which the selection of back-propagation algorithm, the CoDDA algorithm, and DSACD algorithm results achieve higher precision, which is consistent with the results of the paper [25]. To compare the differences, Table 7 lists several error values for the last iteration of the two back-propagation algorithms. The results show that the CoDDA algorithm reduced the error to 17 in the process of minimizing the reconstruction error, but the DSACD finally decreases to 7.9. Both algorithms provide better performance.
Finally, the DSACD is compared with the LPA [3] in Table 8. Then, the DSACD is significantly better than the LPA.
However, due to the characteristics of deep learning, it requires more time during training, as shown in Table 9. In large data sets, computing time needs to grow exponentially.    The weakness of the CoDDA is also shown, and the calculation time is much longer than the DSACD calculation time. The CoDDA requires up to a week or weeks to calculate a larger matrix [13]. In addition, although the calculation time of the DSACD is small, it requires a large amount of memory, and it requires at least 128 G of memory for the network of tens of thousands of nodes, but the CoDDA can normally be calculated on a normal configuration computer.

Parameter analysis
The deep sparse autoencoder for community detection(DSACD) contains three important parameters: the hop threshold (S) in the similarity matrix, the attenuation factor (σ ), and the number of layers in the deep sparse autoencoder (T). These three parameters have a direct impact on the clustering results. This section sets up experiments to find the optimal parameters. The experimental procedure is shown as follows. First, a value is preselected for each parameter. Then, the experiment is performed according to the hop count threshold, the attenuation factor, and the layers order of the deep sparse autoencoder, and each experiment is repeated 5 or 100 times. Then, the optimal results are used as the peremeters in the next experiments. After the first experiments, the three optimal values obtained are reused as the experimental input values applied to adjust the obtained parameters. After the end of the second round, the optimal parameters obtained are output. Each parameter value is initialized as s=1, sigma=0.1, and T=1. The selection of each parameter is random during initialization, and the minimum value is selected as the starting parameter.
The first round of results from the above table is shown in Fig. 7. Figure 7 shows that the proved deep sparse autoencoder for community detection has a better clustering effect than the K-means clustering algorithm, but the significance of  deep learning in the Strike and Football datasets does not seem obvious. The clustering results of the proved deep sparse autoencoder for community detection and the clustering results of the similarity matrix are similar, and the gap is not obvious in the parameter experiment of the attenuation threshold or the layer number. However, after a round of experiments on big data sets, the advantages are already evident. The second-round results are shown in Fig. 8.
After the second-round parameter experiment, we find that the similarity matrix clustering result of the Strike dataset is the best. The proved deep sparse autoencoder for community detection does not improve the clustering result in the process of deep learning, but decreases the result, as shown in Fig. 8c). Therefore, in small datasets, the similarity matrix is utilized to process the adjacency matrix of the graph.
As shown in the Football dataset, the proved DSACD slightly improves the clustering results in the process of deep learning, and the highest value is obtained by the proved deep sparse autoencoder for community detection. The second round of results is significantly better than the first round of clustering quality.
Meanwhile, as shown in the LiveJournal dataset, the accuracy of the proved DSACD is significantly improved on the big data set. After deep learning, the NMI value is increased from 0.7111 to 0.8171, and the degree of improvement is approximately 13%. On the other hand, the NMI top value gradually increases during the course of parameter experiment and reflects the necessity and superiority of deep learning.
In Table 5, the parameter value of S, σ , and T before and after experiments are tested in four data sets. In Strike and Football, the average value of every parameter are detemined with the repeat experimental times are 100, but in LiveJournal, the detemined average value of every parameter only needs 5 repeat experimental times. The repeat experimental times are for comparing with the CoDDA in the same parameter standard.

Visualization results
This experiment is based on the real dataset (Ground-Truth), K-means algorithm, the hop-based clustering method (hop), CoDDA, and the DSACD, which are visual comparison. Intuitively, the cluster information between different communities is observed and evaluated. Figures 9 and 10 show the results. The same color represents the same community, and different colors represent different communities. As seen from Fig. 9, the cluster results of the K-means algorithm are not accurate, and the green communities are basically clustered into yellow communities, which obviously does not conform to the real situation. Both the hop-based and the deep sparse autoencoder-based algorithms (DSACD) cluster accurate results, which are in good agreement with the real-world results [38], indicating the accuracy of the algorithm. Furthermore, because the hop-based algorithm (hop) and the deep sparse autoencoder algorithm (DSACD) are consistent, it shows that the algorithm has little meaning in deep learning of small data sets. The parameters are not properly selected or even lost information. In the data set, the hop count threshold and the attenuation factor should be focused on improving the cluster effect.
As Fig. 10 shows, the K-means algorithm cannot be accurately clustered in the 180node dataset. Note that the K-means algorithm can be used in small maps but cannot be clustered in a slightly complex network because the adjacency matrix still carries much information that leads to unclear community boundaries. After adding the similarity matrix processing, the nearby nodes are connected, so the clustering quality is significantly improved. For the dataset, because the community structure is very obvious, the similarity matrix can obtain good clustering results, and the community structure is initially calculated but will still be confused in the community with close distance. After deep learning, the community clustering with obvious community structure is almost all successful, and there are still fewer nodes with cluster failure for clusters without community structure.
According to Fig. 11, there are 8 communities in the LiveJournal dataset, and a monster community appears for the K-means algorithm clustering. The hop-based processing [13] can successfully cluster nodes with obvious community structure, but for two communities with higher similarity, that is, the situation with numerous edges between communities cannot be correctly handled, and the connection between two communities with a close relationship can easily be clustered into a third community. As shown in   Fig. 12, the yellow community section should be green. In addition, for nodes with fewer neighbors, the green node group on the left side of the figure should be pink, but there is no successful cluster. After learning the network features, the left green node group and the pink node group are successfully merged in the DSACD diagram. The clustering accuracy is further improved for the green node community, and finally, the community with the obvious community structure is clustered. The experiments in Section 3.5, which has compared the time and results of DSACD and CODDA. To visualize the results of the two algorithms, the LiveJournal dataset is extracted, and the results of the two algorithms are visually compared. For the deep sparse autoencoder for community detection, except for the four communities with a high coupling degree in the upper right and upper left, the other four communities are clustered accurately. The same situation appears in the CoDDA. Four communities are obviously clustered successfully. At the same time, some mistakes emerge in the course of clustering such as one community clustered into two communities and two communities clustered into one. Compared comprehensively, the effect of the two algorithms is close.
The clustering method based on the K-means algorithm is random, especially in small datasets because the boundary between two communities is not easy to judge. The clustering results will change due to the selection of initial points, especially the time selected of the influential points will directly affect the clustering results. For small data sets, the results need to be averaged many times, and the final results will converge at a certain value. With the increase in datasets, the size of community groups also expands, and the proportion of nodes in the border area relative to the entire map also decreases, so the clustering results tend to be stable.
In the community detection algorithm based on the deep sparse autoencoder and L-BFGS, parameters need to be selected. Among these parameters, the hop threshold of the small-scale network is smaller. Because the size of the community is small, the influence of the relationship between nodes is smaller. The appropriate hop threshold can be taken from 2 to 3, and the calculation requires less time. For a large-scale network, the threshold of hops will increase correspondingly. At this time, the community structure is obvious, and the scale is large, and the relationship between nodes is complex, so the corresponding influence will increase. The threshold of hops can be taken from 6 to 8. At the same time, large-scale datasets need to be reduced several times to improve the accuracy of the community feature extraction. However, regardless of any parameter, it should not be too large or too small; otherwise, it will lead to data redundancy or missing data. This situation also demonstrates the necessity of the parameter experiment and also illustrates the necessity of conducting a parametric experiment. When compared with other algorithms or with itself, the DSACD has higher accuracy. The similarity matrix plays a dominant role in small data sets. On the large dataset, a further dimension reduction operation based on the deep sparse autoencoder is needed for feature extraction. In the process of training, the CoDDA or the DSACD can be used for back-propagation. The CoDDA has the characteristics that it does not need to calculate the Hessian matrix and saves memory, but it takes a long time. The DSACD is characterized by more accurate calculation, but it requires more memory. In large data sets, the program may crash due to insufficient memory. It is necessary to determine in advance whether the hardware configuration meets the requirements. The accuracy of the two algorithms is slightly higher than the accuracy of the CoDDA algorithm.
Through the visualization software [42], the results of the K-means clustering can hardly be separated from the community, resulting in the emergence of a giant community that is the monster community. After calculating the similarity matrix, the community structure appears. Finally, dimension reduction by the deep sparse automatic encoder can separate the similar communities more accurately and further improve the clustering accuracy.

Application on the indoor positioning system
In this section, an application test with the benchmarking data of our indoor positioning systems is designed. Figure 13 depicts a public place in our library, in which the total area is 100m 2 and area measurement is 64m 2 . Four APs are installed in four corners, and 64 points are set in the place. There are 9379 records of RSSI collected by smartphone-MI note2. Our DSACD can be used to gather 64 communities, and then, the distance between every points and each AP can be obtained by the log-distance path loss model [43]. Formula 15 is a logarithmic distance ranging model for indoor wireless signal transmission.
In formula 15, RSSI(d 0 ) represents the signal intensity when distance is between AP and signal source, and is a random variable that obeys normal distribution ( ; N(0, σ dB 2 )). β represents the path loss factor, and the indoor environment is usually set to 3 or 4.
Suppose the distance between the two APs and the signal source is d 1 and d 2 , and the signal intensity difference among them is dB.
Since two APs are in the same localization area, let 1 = 2 . Formula 16 is subtracted from formula 17, and the result is as follows: (2020) 2020:91 Page 21 of 25

Fig. 13
The test scenario in our library Convert formula 18 to formula 19: As shown in formula 19, dB describes the distance relationship between the two APs. As shown in formula 20, FingerPrint i represents the signal intensity between the ith AP and m signal source. MinFingerPrint i represents the weakest signal intensity between the ith AP and jth signal source.
The geometric meaning of fingerprint is the distance between the reference point and AP. Fingerprint localization model is divided into two stages: offline and online. During the offline stage, the localization area is divided into different clusters by DSACD, using the technique proposed in this article and the binary classification for APs by K-means algorithm in each subarea to select available APs. During the online stage, subareas where the object points exist are selected using NN algorithm. Then, the coordinate of the target point is calculated.
The DSACD are used in offline stage. The selected 64 test points as shown in Fig. 13, while the fingerprint database are divided 64 subareas with 64 reference points as the centroids. The operation process is as follows: the collected signals in the whole region are transformed into fingerprints, and then, the fingerprints are as inputs of DSACD to realize regional classification. These effective fingerprint components are extracted from all the fingerprints in the subregion, then the fingerprint is transformed into distance fingerprint according to the fingerprint transformation model, and finally, the fingerprint database of the subregion is formed. Figure 14 indicates the average errors of the distance between every point and each AP. The average distance error between 64 points and 4 AP shows normal distribution that is according with the laws of nature; meanwhile, a loop occurs at every eight points that is according with the law of collection. It is shown in Fig. 13 that every column has 8 points. In addition, the 64 average errors show that nodes with distance error less than 0.5 from AP1 account for 26.6% of the total number of nodes, nodes with distance error between 1.5 and 2 from AP1 account for 15.63% of the total number of nodes, nodes with distance error less than 0.5 from AP2 account for 21.88%, nodes with distance error between 1.5 and 2 from AP2 account for 10.94%, nodes with distance error less than 0.5 from AP3 account for 21.88%, nodes with distance error between 1.5 and 2 from AP3 account for 15.63%, nodes with distance error less than 0.5 from AP4 account for 10%, and nodes with distance error between 1.5 and 2 from AP4 account for 21.88%. For the 4 APs, nodes with higher distance error accuracy are reaching 21.88% which are from AP2/AP3, nodes with lower distance error are reaching 21.88% which is from AP4. During the process of measuring, there is voltage signal interference near AP4, whose reasonableness is confirmed by the calculation results.
In the collection environment, factors that have strong impacts for measured data are the temperature, angle, humidity, and crowd density. Sixty-four communities are gathered by DSACD, and then, the log-distance path loss model is used in every community to obtain the distance between every point and each AP. The achieved average errors can satisfy the necessary of location, which has a certain reference significance for the realtime of future research intelligent navigation positioning.

Conclusion
This paper proved a novel deep sparse autoencoder-based community detection (DSACD) and compares it with K-means, Hop, CoDDA, and LPA algorithm. Experiments show that for complex network graphs, dimensionality reduction by similarity matrix and deep sparse autoencoder can significantly improve clustering results. Several issues persist and require further research. The similarity matrix calculation increases with the matrix size, which lead to a large memory consumption and high requirements for experimental equipment. Too many temporary variables in the backpropagation algorithm will also consume memory. Decomposition strategy for large matrix in similarity calculation should be expected in future studies.