Artificial immune network clustering based on a cultural algorithm

Data mining technology has been applied in many fields. Prototype-based cluster analysis is an important data mining method, but its ability to discover knowledge is limited because of the need to know the number of target data categories and cluster prototypes in advance. Artificial immune evolutionary network clustering is a clustering method based on network structure. Compared with prototype-based cluster analysis, it has the advantage of realizing unsupervised learning and clustering without any prior knowledge of data. However, artificial immune evolutionary network clustering also has problems such as a lack of guidance in the clustering process, fuzzy boundary sensitivity, and difficulty in determining parameters. To solve these problems, an artificial immune network clustering algorithm based on a cultural algorithm is proposed. First, three kinds of knowledge are constructed: normative knowledge is used to regulate the spatial range of population initialization to avoid blindness; state knowledge is used to distinguish the type of antigen, and immune defense measures are taken to prevent the network structure caused by noise and boundaries from being unclear; and topology knowledge is used to guide the antigen for optimal antibody search. Second, topology knowledge in the cultural algorithm is used to characterize the distribution of antigens and antibodies in space, and elite learning is used to improve the traditional clone mutation operator. Based on the shadow set theory, a method for adaptively determining the compression threshold is proposed. Finally, the results of simulation experiments show that the proposed algorithm can effectively overcome the above problems, and the clustering performances on a synthetic dataset and an actual dataset are satisfactory.


Introduction
Data mining is the process of discovering specific information hidden in massive databases through various algorithms [1].With the accumulation of massive data brought by the development of information technology, the use of data mining technology to transform these data into useful information has been used in various fields [2,3].In the field of cloud services, work [4] proposed a distributed cloud service method based on distributed sensitive hashing in multisource data.Work [5] proposed a big datadriven mashup building method that supports economic software developments.In the 1.We propose using the cultural algorithm to guide the aiNet clustering algorithm and use the topology knowledge in the cultural algorithm to characterize the distribution of antigens and antibodies in the space, which greatly reduces the complexity of the algorithm search.2. Based on the concept and theory of the shadow set, we propose a method for adaptive determination of the compression threshold based on the shadow set, which improves the ability of the algorithm to quickly solve the algorithm.3. Drawing on the immune defense suppression measures adopted in medicine to avoid excessive epidemic prevention, we propose a new algorithm immune defense mode, which improves the flexibility of algorithm application.
The organizational structure of the paper is as follows.First, we discussed related work in Section 2. Subsequently, we introduce the structure of the CaiNet algorithm in Section 3, define three kinds of knowledge in Section 3.1, design and improve the operation operator in Section 3.2, simplify the optimal antibody search for antigens through topology knowledge steps, formulate the mutation rules of the algorithm, propose a method of adaptive determination of the compression threshold based on the shadow set, and formulate the immune defense criteria of the CaiNet algorithm.The algorithm steps of the CaiNet algorithm are summarized in Section 4. Finally, we select three types of datasets in Section 5 for simulation experiments and evaluate the stability and convergence performance of the CaiNet algorithm.

Related work
Artificial immune evolutionary network clustering is a clustering method based on network structure [28].Compared with the prototype clustering method, it can achieve real unsupervised learning and clustering.Based on the basic aiNet algorithm, Li Jie et al. introduced the concept of taboo cloning in immunology to the artificial immune network clustering algorithm, which solved the problem that aiNet cannot handle the fuzzy boundary of the sample subset [29].Considering the problem of memory network dynamics and irregular changes caused by the lack of objective function guidance of the aiNet algorithm, Guo Jianhua et al. established the overall objectives and constraints of the memory network by defining quality evaluation standards, thus realizing the guidance of the algorithm, and discussed the value of the compression threshold [30].To overcome the problem that the monoclonal algorithm easily falls into local optimum, Zhou Yang et al. proposed an evolutionary immune network clustering algorithm based on polyclonal algorithms [31].Ma Li et al. applied a variety of artificial immune system operators to the clustering process.Based on the basic principles of biological immunity and cloning, they proposed an adaptive multiclone clustering algorithm that automatically adjusts clustering categories by setting affinity functions to increase the antibody population diversity of individuals to expand the search range of the solution and avoid precocity of the algorithm [32].It has been found in experiments that for unbalanced datasets, clusters of small samples are easily undetectable when using a large taboo threshold.However, the improved aiNet algorithm does not have a unified understanding of the death threshold, compression threshold, and taboo threshold to be input.In many cases, it needs to be determined according to the characteristics of the data, which makes the algorithm more difficult to apply.
This paper defines three kinds of knowledge in the CaiNet algorithm: normative knowledge, topology knowledge, and state knowledge.Normative knowledge provides a code of conduct for evolution, topological knowledge easily guides the expansion of the network in different spaces, and state knowledge is used to control the strength of antigen activation networks in different states.In this paper, the topology unit is used to form the topology knowledge in the cultural algorithm, which simplifies the optimal antibody search step and formulates new mutation rules, which overcomes the limitations of the traditional algorithm.The determination of the compression threshold is the difficulty of most algorithms.Based on the concept and theory of the shadow set, this paper proposes an adaptive determination method for the compression threshold based on the shadow set, which is conducive to the rapid solution of the algorithm.To avoid the unclear structure of the immune network caused by the boundary data in the traditional algorithm, it may prevent the network from accurately expressing the distribution of antigens so that it does not activate the immune network.This article refers to the immune defense suppression measures taken to avoid excessive defense in medicine.For the noise, the boundary, and the antigen inside the cluster, three different methods are used to treat them differently.
To test the effect of the new algorithm, we select three UCI datasets as the experimental objects and compare the average accuracy and variance in the algorithm.The experimental results show that the stability and convergence of the new algorithm and the performance significantly improve.

Method
Cultural algorithms use trust space and population space for double-layer evolution.The population space forms different types of knowledge through trial and error in the processing of trust space and then guides the evolution of the population space.The designed algorithm structure is shown in Fig. 1.

Background knowledge
In this algorithm, three kinds of knowledge are defined.Normative knowledge defines the interval range of the antigen and each generation of antibodies and provides a behavioral rule for evolution.Topological knowledge expresses the distribution of antigens and antibodies in the search space and provides opinions and recommendations for immune recognition suggestions that are helpful to guide the expansion of the network in different spaces.State knowledge records the different states that the antigens may be in and is used to control the strength of the antigen activation network in different states.

Choice
Trust space spatial relationship Definition 1 Antibody-antigen affinity.Antibody-antigen affinity is the measurement of affinity between the antibody and antigen and is described in detail in formula (1).

Immune cyberspace
In the formula, ‖⋅‖ represents the Euclidean distance, G represents the antigen collection, and g i represents a single data sample.B k represents the immune network, that is, the antibody collection, and b k,j represents the k antibody in the jth network.
Definition 2 Antibody-antibody affinity.Antibody-antibody affinity is expressed by the Euclidean distance d i,j between the antibodies, which can form the affinity matrix the network, and N k represents the number of antibody neurons in the k-th network.
Definition 3 Clone operation.The clone operation selects a part of the antibody with a high affinity to copy.For antibody b i , the clone operation can be expressed as: where N c represents the total antibody size after cloning, A i represents the affinity of the i antibody, and N represents the number of antibodies participating in the clone.Definition 4 Normative knowledge.Normative knowledge records the spatial range of antibody production; one range is the value interval of each dimension of the antigen, and the other is the value interval of each dimension of the memory network antibody neuron, which is represented by N 0 and N t, and its formal definition is: where l i represents the lower bound, u i represents the upper bound, and i represents the ith dimension.N 0 is static knowledge and does not change throughout the clustering process; N t is dynamic knowledge, which changes with each network change.The superscripts of l t i and μ t i in N t represent the number of iterations.
In the internal image of antigens, antibody neurons are generally distributed in the space determined by all antigens; therefore, the antibody population should be within the space determined by N 0 during initialization.As the network evolves, it should gradually converge because such a network is more refined and clustering is more obvious.To achieve this goal, N t is used to guide the initialization of the antibody population.When the population is initialized, most of the antibodies (80%) are generated in the specified space.To avoid a suboptimal algorithm solution, some of the antibodies (20%) are also generated in the residual set of N t relative to N 0 , forming a disturbance and preventing the network from falling into the local optimal solution.
Definition 5 Topology knowledge.The topological unit refers to the hypergeometric region with an antibody as the center and l j as the edge length of the jth dimension.
The knowledge about antibody and antigen features contained in all topological units is called topology knowledge.The topological unit represented by antibody b j can be expressed as: In the formula, b i; j − l j 2 and b i; j þ l j 2 represent the upper and lower bounds, respectively, of the topological unit represented by the antibody on the first dimension.By calculating the coordinates of the antibody and antigen in space, we can determine whether it belongs to a topological unit.If an antigen g j belongs to a topological unit T i , it is recorded as g j ∈ T i .There may be intersections between topological units.When an antigen belongs to more than one topological space, the distance between the antigen and the center of the topological unit is calculated, and the smallest distance is taken as the topological unit of the antigen.Due to the distribution of antibodies, some antigens may not be in any topological units.In this case, the distance between antigens and the center of all topological units is calculated, and the one with the smallest distance is taken as the topological unit.
When the topological unit is determined, the antigen can be mapped into the topological unit.Since antibody b i is the center of topological unit T i , and according to the principle of immune network clustering, the antibody is the inner image of the network.Therefore, we call antibody b i the representative point of the antigen contained in topological unit T i .In particular, when the topological unit does not contain any antigens, it is deleted.
Definition 6 State knowledge.Without losing generality, the data in the dataset are divided into noise, boundary, and cluster internal points, and state knowledge is used to record the different antigen states.
Topological elements can be regarded as grids with knowledge characteristics.According to the existing grid-based clustering methods, noise and boundary points (including fuzzy boundaries) are significantly different from the data within the cluster.It has been found that the noise and boundary points of the data include but are not limited to the following features: the area where the noise and boundary points are located is generally sparse; the difference between the boundary points and the class interior is that the latter often has close neighbors in multiple directions, that is, the uniformity is relatively good; the density of the area where the boundary points are located generally has a jump.The difference between different point sets mainly lies in the density and uniformity, the noise density is small, the density at the boundary is small and uneven, and the data density inside the cluster is large and evenly distributed.The density is expressed by the number of antigens in the grid, i.e., The joint entropy method is used to measure the uniformity of the data distribution in the topological unit.For each antigen b j,i in the topological unit, the number of antigens is calculated in its ε neighborhood, and it is recorded as ρ j, i , ε ¼ l j .

4
. l j is the length of the side of T j , and is recorded.The entropy of b j,i can be expressed as: Furthermore, we can obtain the combined entropy of all antigens in T j The data can be divided into 3 categories according to prior knowledge, so this is a two-dimensional clustering problem with a known number of categories, which can be solved well using methods such as fuzzy C-means.After the clustering is completed, the antigens in the corresponding topological units can be labeled as noise, boundary points, and cluster internal data.
When a data point is marked incorrectly, the algorithm may be guided in the wrong direction.The distribution of antibodies has randomness, and a clustering algorithm is not always effective.Therefore, misclassification always occurs.To avoid the impact of this situation, the idea of evidence accumulation is introduced.Evidence accumulation refers to adding 1 to the evidence value of an antigen if it is labeled in the same state in the adjacent time sequence, and 1 is subtracted from the evidence value if it is labeled in different states in the adjacent time sequence.Because of the randomness of the antibody, this can greatly reduce the impact of misclassification.According to the above methods, state knowledge can be expressed as: where S i represents the state of the ith antigen, D i represents the evidence of the state, and D i is equal to 1 at the initialization phase.

Optimal antibody search
We use topological units (hypergeometry) to form topology knowledge in cultural algorithms.Topology knowledge includes two parts: antigen and antibody.Therefore, we hope to simplify the optimal antibody search by topology knowledge.
According to topological knowledge, antibodies can be regarded as representative points of antigens in antibody units, and the distance between antibodies with a high affinity and their representative points should be small.Therefore, we can first use a representative point antibody to find the k ' > k antibody with the smallest distance, then calculate the affinity between the k ' antibody and antigen, and take the k antibody with the highest affinity as the optimal k antibody.Its pseudocode is: For other antigens belonging to j , only step 2 is needed to find the optimal K antibody, which can greatly reduce the complexity.
The value of k ' should be greater than k because there is a certain distance deviation between b j and g i .In practice, the greater the difference between k ' and k, the more accurate the results obtained, and the cost is the expansion of the search range.Considering the uniformity of antibody distribution in the network, k ' is generally taken as 3k .

Elite learning variation
In traditional aiNet clustering, antibody improvement is achieved by clone variation, expressed as where α represents variability, and the value decreases with increasing b j and g i affinity.
Formula (11) improves the antibody by reducing the distance between antibody b j and antigen g i , but this method still has some limitations, such as b j being only close to the antigen and not focusing on learning from other antibodies.To make the target antibody obtain the advantage information of outstanding antibodies at the same time, the following variation rules are formulated: where b 0 represents the antibody with the highest affinity with g i .In the current network, r 1 and r 2 are weighting factors, meeting the requirement of r 1+ r 2 = 1; if b 0 ≡b j , r 1 = 1.In fact, when r 1 = 1, it degenerates into the mutation strategy of a traditional algorithm.

Compression threshold determination
There is no unified understanding of how to set the compression threshold.The general guidance is to take a very small compression threshold first, for example, 10 −3 , and gradually increase it with the change in the network.There is little discussion on this in the existing literature.According to the concept and theory of shadow sets, we propose an adaptive method to determine the compression threshold.Shadow sets is a theory proposed by Pedrycz to address fuzzy problems, in which set levels 1,0 and [0,1] are used to describe and simplify fuzzy relationships.The sample points corresponding to level 1 belong to a set completely, [0,1] indicates whether the sample point belongs to a set or not.The 0 corresponding sample point does not belong to a collection at all.The above three levels correspond to the complements of the lower approximation, upper approximation and lower approximation relative to the upper approximation.
The purpose of network compression is to improve the affinity between antibody and antibody, that is, to increase the distance between antibodies and prevent network redundancy caused by a small distance.The smaller the distance, the more likely it is to be compressed, and the greater the distance, the more likely it is to not be compressed.Without losing generality, we use the normalization of distance to express the possibility membership degree of whether the antibody should be compressed.The possibility membership of whether the antibody should be compressed is defined as the mapping of the distance between the antibody and the antigen to the [0,1] closed interval, expressed by the formula: The objective function is defined as: where a is in the range of (0,0.5],ξ 1 ¼ When the α value is determined, the part of the antibody satisfying μ i, j ≤ α needs to be compressed.Obviously, according to this threshold determination method, a certain number of antibodies are compressed each time, which is consistent with the actual situation of network compression in the algorithm.In addition, F(α) is a simple step-like unimodal function that can be quickly solved by methods such as dichotomy.

Immune defense
According to the traditional aiNet clustering method, regardless of the nature of the antigen, the antibody can generate an immune response and then activate the antibody network.This is the main reason for the unclear structure of the immune network due to "abnormal" data such as noise and fuzzy boundaries.
The immune defense mechanism means that the immune system can attack, destroy, and clear "alien components" such as bacteria, viruses, and foreign bodies, which is a very important protection mechanism for the human body.We simulate this process in the algorithm.
To defend against "alien elements," we must first identify the "alien elements" according to the state knowledge constructed in the cultural algorithm.It is convenient to determine the "alien component," that is, the parts marked as the noise and boundary in the state knowledge.
In the clustering problem, because the boundary data easily cause the immune network structure to be unclear, it does not activate the immune network, which creates the problem that it may make the network unable to accurately express the distribution of the antigens.
To avoid this problem, three different methods are adopted to treat the antigens in noise, boundary and cluster according to the immune defense inhibition measures taken to avoid overdefense in medicine, namely, g i ∈S 0 ; Clonal dominant antibody selection and variation g i ∈S 1 ; dominat antibody selection and mutation g i ∈S 2 ; do not operate where S 0 , S 1 , and S 2 represent the interior, boundary, and noise antigen set of the cluster, respectively, and the noise and boundary points are defended differently by the immune defense mechanism guided by state knowledge.If noise is no longer involved in the immune process, it is eliminated directly.Boundary points do not participate in the process of cloning to avoid the generation of a large number of cloned antibodies at the boundary and prevent the blurring of network structure at the boundary.The reason why boundary points participate in the selection and variation is to avoid the excessive movement of antibodies to the clustering center, resulting in a lack of affinity between the boundary and antibody network, thus leading to the problem of boundary point misclassification.

Specific steps
For the final immune network, the minimum spanning tree is generated according to its connected graph.There is a larger weight between the representative antibodies of two different clusters.According to the set pruning threshold, the m connections with larger weights are removed so that m+1 clusters can be obtained.The steps of the CaiNet algorithm are shown as: After the data points in these units are eliminated, the dataset is recorded as X = {x 1 , x 2 , ⋯, x i , ⋯, x n }, and the clustering is recorded as C 1 ,C 2 ,⋯, C j ,⋯,C m .Next, determine the type of data based on the distance between the data point and the antibody, that is, 4 Experiments The running configurations include hardware settings (2.70 GHz CPU, 8.0 GB RAM) and software settings (Windows 10 and Python 3.6).Each test is executed 50 times to record their average performances.

Experimental results on a synthetic dataset
High-dimensional data are not easy to display intuitively, so we use a two-dimensional synthetic dataset to verify the proposed clustering algorithm.There are three clusters in the dataset, two of which have more samples, and the other contains fewer samples.
There are fuzzy boundaries between the three clusters, and they contain many instances of sample noise.
As shown in Fig. 2, the minimum spanning tree obtained by CaiNet can be divided into three distinct categories.The nodes of the tree can better reflect the data distribution of the dataset, and the nodes are relatively uniform.According to the algorithm, the last operation before obtaining the final minimum spanning tree is network compression.Therefore, the uniform distribution of the nodes is related to the selection of the compression threshold, which also shows that selecting the threshold using the shadow sets method is effective and can avoid the blindness of choosing a fixed compression threshold.The taboo clone method is not used in the algorithm, but the algorithm is also effective for datasets with fuzzy boundaries, indicating that the immune defense principle can achieve the same effect as the taboo clone.
The algorithm clusters the noise, boundary, and normal data and explicitly eliminates the noise.From the results, we can see that most of the noise in the data can be identified by the algorithm.Since taboo cloning is not used, the new algorithm does not need to set the taboo threshold in advance, which is very convenient and effective in practice.

Experimental results on a real-world dataset
To test the clustering effect of the algorithm on actual high-dimensional data, we choose the iris, wine, and seeds UCI datasets as experimental objects.
The average correct rate represents the proportion of the data that the algorithm classifies in the cluster correctly.To test the stability of the algorithm, we test the specified algorithm 50 times on the datasets to obtain a variance in the accuracy after 50 times.Obviously, the smaller the variance is, the higher the stability of the algorithm.For these three datasets, the comparison between the CaiNet algorithm and the aiNet clustering algorithm is shown in Table 1.As seen from the table, in the comparative experiment results, the variance in the CaiNet algorithm is smaller than that of the aiNet clustering algorithm, and the average correct rate of the CaiNet algorithm is higher than the average correct rate of the aiNet clustering algorithm, which shows that the CaiNet algorithm is more stable.When the seeds dataset is selected as the experimental object, the variance in the CaiNet algorithm decreases the most, which is 3.71% less than the variance in the aiNet clustering algorithm.The average accuracy of the CaiNet algorithm is the highest, which is 5.8% higher than the average accuracy of the aiNet clustering algorithm.When the wine dataset is selected as the experimental object, the variance in the CaiNet algorithm is the smallest at only 0.66%; at the same time, the average correct rate of the CaiNet algorithm reaches 98.78%.It can be seen that the variance and average correct rate of the CaiNet algorithm are affected by the selected dataset, and the degree of improvement of its algorithm stability is also related to the selected dataset.
In addition, we test the algorithm convergence performances.In the running time of the simulation experiment, we choose to perform 100 simulation operations.The results are shown in Figs. 3, 4, 5, and 6.As seen in the figures, the CaiNet algorithm has the best balance and the highest accuracy.The CaiNet algorithm also has a higher recall and hit rate than the other methods.Therefore, the CaiNet algorithm has better convergence performance.

Discussion
We tested and evaluated our proposed CaiNet method with a baseline method named aiNet to prove the advantages of our method.However, several additional points should be noted and further analyzed in detail, which are specified below.
1.For the three compared datasets in the experiments in Subsection 4.2, i.e., iris, wine, and seeds, their data volumes are all not large enough (the three sample sizes are 150, 178, and 210, respectively).Therefore, future work, we need to investigate more appropriate and larger datasets to validate the feasibility of our model and method, especially in the big data environment.2. Although our CaiNet method performs better than the compared baseline aiNet method, the accuracy of the CaiNet method is still not very high (92.16%,98.78%, 88.24%).Therefore, we need to seek more efficient improvements to refine our work in this paper.3. Clustering is often a time-consuming task that requires a high time complexity, which is often not very suitable for the big data environment.Therefore, lightweight clustering methods are often required.We will further optimize our method to accommodate big data volume.

Conclusion
In paper, cultural knowledge is used to guide the clustering of aiNet, and topology knowledge of the cultural algorithm is used to represent the distribution of antigens and antibodies in the space.Antigens only need to search using the antibodies in the topological unit when finding the highest affinity antibody, which greatly reduces the complexity.Through immune defense, the flexibility of algorithm application is improved.According to the theory of shadow sets, an adaptive method to determine the compression threshold is proposed; the traditional clonal mutation operator is improved by elite learning, which speeds up the convergence of the network.From the simulation experiment, we can see that the accuracy and stability of the improved algorithm have been improved, which proves its effectiveness.
In the future, we will continue to refine our work by considering more complex scenarios, such as multidimensional clustering problems.In addition, how to adapt our method to big data application requirements is another open question that requires intensive study.

Fig. 1
Fig. 1 AiNet clustering principle based on a cultural algorithm

and ξ 3 =
card (I) represents modulo set A, and

Fig. 2
Fig.2Clustering effect on a synthetic dataset

Fig. 3 Fig. 4
Fig. 3 The comparison of algorithm balance

Fig. 5 Fig. 6
Fig. 5 The comparison of algorithm recall

Table 1
Comparison of clustering performance between the two algorithms for three datasets.Deng et al.EURASIP Journal on Wireless Communications and Networking (2020) 2020:168