Cultural algorithms use trust space and population space for double-layer evolution. The population space forms different types of knowledge through trial and error in the processing of trust space and then guides the evolution of the population space. The designed algorithm structure is shown in Fig. 1.
Background knowledge
In this algorithm, three kinds of knowledge are defined. Normative knowledge defines the interval range of the antigen and each generation of antibodies and provides a behavioral rule for evolution. Topological knowledge expresses the distribution of antigens and antibodies in the search space and provides opinions and recommendations for immune recognition suggestions that are helpful to guide the expansion of the network in different spaces. State knowledge records the different states that the antigens may be in and is used to control the strength of the antigen activation network in different states.
Definition 1 Antibody-antigen affinity. Antibody-antigen affinity is the measurement of affinity between the antibody and antigen and is described in detail in formula (1).
$$ f\left(\mathbf{g},\mathbf{b}\right)=\frac{1}{1+\left\Vert \mathbf{g}-\mathbf{b}\right\Vert } $$
(1)
In the formula, ‖⋅‖ represents the Euclidean distance, G represents the antigen collection, and gi represents a single data sample. Bk represents the immune network, that is, the antibody collection, and bk,j represents the k antibody in the jth network.
Definition 2 Antibody-antibody affinity. Antibody-antibody affinity is expressed by the Euclidean distance di,j between the antibodies, which can form the affinity matrix \( {D}_k={\left({d}_{i,j}\right)}_{N_k\times {N}_k} \) of the network, and Nk represents the number of antibody neurons in the k-th network.
Definition 3 Clone operation. The clone operation selects a part of the antibody with a high affinity to copy. For antibody bi, the clone operation can be expressed as:
$$ C\left({b}_i\right)=\left[{b}_{i,0},{b}_{i,1},L\kern0.5em ,{b}_{i,n-1}\right],\kern0.5em n= Int\left({N}_c\times \frac{A_i}{\sum \limits_{j=1}^N{A}_j}\right) $$
(2)
where Nc represents the total antibody size after cloning, Ai represents the affinity of the i antibody, and N represents the number of antibodies participating in the clone.
Definition 4 Normative knowledge. Normative knowledge records the spatial range of antibody production; one range is the value interval of each dimension of the antigen, and the other is the value interval of each dimension of the memory network antibody neuron, which is represented by N0 and Nt, and its formal definition is:
$$ {N}_0=\left\{\;{l}_1,{u}_1;{l}_2,{u}_2;L;{l}_n,{u}_n\;\right\} $$
(3)
$$ {N}_t=\left\{\;{l}_1^t,{u}_1^t;{l}_2^t,{u}_2^t;L;{l}_m^t,{u}_m^t\;\right\} $$
(4)
where li represents the lower bound, ui represents the upper bound, and i represents the ith dimension. N0 is static knowledge and does not change throughout the clustering process; Nt is dynamic knowledge, which changes with each network change. The superscripts of \( {l}_i^t \) and\( {\mu}_i^t \) in Nt represent the number of iterations.
In the internal image of antigens, antibody neurons are generally distributed in the space determined by all antigens; therefore, the antibody population should be within the space determined by N0 during initialization. As the network evolves, it should gradually converge because such a network is more refined and clustering is more obvious. To achieve this goal, Nt is used to guide the initialization of the antibody population. When the population is initialized, most of the antibodies (80%) are generated in the specified space. To avoid a suboptimal algorithm solution, some of the antibodies (20%) are also generated in the residual set of Nt relative to N0, forming a disturbance and preventing the network from falling into the local optimal solution.
Definition 5 Topology knowledge. The topological unit refers to the hypergeometric region with an antibody as the center and lj as the edge length of the jth dimension. The knowledge about antibody and antigen features contained in all topological units is called topology knowledge. The topological unit represented by antibody bj can be expressed as:
$$ {T}_i=\left\{{b}_{i,1}-\frac{l_1}{2},{b}_{i,1}+\frac{l_1}{2};{b}_{i,2}-\frac{l_2}{2},{b}_{i,2}+\frac{l_2}{2};L;{b}_{i,m}-\frac{l_m}{2},{b}_{i,m}+\frac{l_m}{2}\right\} $$
(5)
In the formula, \( {b}_{i,j}-\frac{l_j}{2} \) and \( {b}_{i,j}+\frac{l_j}{2} \) represent the upper and lower bounds, respectively, of the topological unit represented by the antibody on the first dimension. By calculating the coordinates of the antibody and antigen in space, we can determine whether it belongs to a topological unit. If an antigen gj belongs to a topological unit Ti, it is recorded as gj ∈ Ti. There may be intersections between topological units. When an antigen belongs to more than one topological space, the distance between the antigen and the center of the topological unit is calculated, and the smallest distance is taken as the topological unit of the antigen. Due to the distribution of antibodies, some antigens may not be in any topological units. In this case, the distance between antigens and the center of all topological units is calculated, and the one with the smallest distance is taken as the topological unit.
When the topological unit is determined, the antigen can be mapped into the topological unit. Since antibody bi is the center of topological unit Ti, and according to the principle of immune network clustering, the antibody is the inner image of the network. Therefore, we call antibody bi the representative point of the antigen contained in topological unit Ti. In particular, when the topological unit does not contain any antigens, it is deleted.
Definition 6 State knowledge. Without losing generality, the data in the dataset are divided into noise, boundary, and cluster internal points, and state knowledge is used to record the different antigen states.
Topological elements can be regarded as grids with knowledge characteristics. According to the existing grid-based clustering methods, noise and boundary points (including fuzzy boundaries) are significantly different from the data within the cluster. It has been found that the noise and boundary points of the data include but are not limited to the following features: the area where the noise and boundary points are located is generally sparse; the difference between the boundary points and the class interior is that the latter often has close neighbors in multiple directions, that is, the uniformity is relatively good; the density of the area where the boundary points are located generally has a jump. The difference between different point sets mainly lies in the density and uniformity, the noise density is small, the density at the boundary is small and uneven, and the data density inside the cluster is large and evenly distributed. The density is expressed by the number of antigens in the grid, i.e.,
$$ {\rho}_j=\sum \limits_{b_i\in {T}_j}1 $$
(6)
The joint entropy method is used to measure the uniformity of the data distribution in the topological unit. For each antigen bj,i in the topological unit, the number of antigens is calculated in its ε neighborhood, and it is recorded as ρj, i, \( \varepsilon =\raisebox{1ex}{${l}_j$}\!\left/ \!\raisebox{-1ex}{$4$}\right. \). lj is the length of the side of Tj, and
$$ {p}_{j,i}=\frac{\rho_{j,i}}{\rho_j} $$
(7)
is recorded. The entropy of bj,i can be expressed as:
$$ {H}_{j,i}=-{p}_{j,i}\log {p}_{j,i} $$
(8)
Furthermore, we can obtain the combined entropy of all antigens in Tj
$$ {H}_j=\sum \limits_{b_{j,i}\in {T}_j}{H}_{j,i}=-\sum \limits_{b_{j,i}\in {T}_j}{p}_{j,i}\log {p}_{j,i} $$
(9)
The data can be divided into 3 categories according to prior knowledge, so this is a two-dimensional clustering problem with a known number of categories, which can be solved well using methods such as fuzzy C-means. After the clustering is completed, the antigens in the corresponding topological units can be labeled as noise, boundary points, and cluster internal data.
When a data point is marked incorrectly, the algorithm may be guided in the wrong direction. The distribution of antibodies has randomness, and a clustering algorithm is not always effective. Therefore, misclassification always occurs. To avoid the impact of this situation, the idea of evidence accumulation is introduced. Evidence accumulation refers to adding 1 to the evidence value of an antigen if it is labeled in the same state in the adjacent time sequence, and 1 is subtracted from the evidence value if it is labeled in different states in the adjacent time sequence. Because of the randomness of the antibody, this can greatly reduce the impact of misclassification. According to the above methods, state knowledge can be expressed as:
$$ S=\left\{\kern0.5em {S}_1,{D}_1;{S}_2,{D}_2;L;{S}_i,{D}_i;L;{S}_n,{D}_n\kern0.5em \right\} $$
(10)
where Si represents the state of the ith antigen, Di represents the evidence of the state, and Di is equal to 1 at the initialization phase.
AiNet clustering based on a cultural algorithm
Optimal antibody search
We use topological units (hypergeometry) to form topology knowledge in cultural algorithms. Topology knowledge includes two parts: antigen and antibody. Therefore, we hope to simplify the optimal antibody search by topology knowledge.
According to topological knowledge, antibodies can be regarded as representative points of antigens in antibody units, and the distance between antibodies with a high affinity and their representative points should be small. Therefore, we can first use a representative point antibody to find the k' > k antibody with the smallest distance, then calculate the affinity between the k' antibody and antigen, and take the k antibody with the highest affinity as the optimal k antibody. Its pseudocode is:
For other antigens belonging to Tj, only step 2 is needed to find the optimal K antibody, which can greatly reduce the complexity.
The value of k’ should be greater than k because there is a certain distance deviation between bj and gi. In practice, the greater the difference between k’ and k, the more accurate the results obtained, and the cost is the expansion of the search range. Considering the uniformity of antibody distribution in the network, k’ is generally taken as \( \raisebox{1ex}{$3k$}\!\left/ \!\raisebox{-1ex}{$2$}\right. \).
Elite learning variation
In traditional aiNet clustering, antibody improvement is achieved by clone variation, expressed as
$$ {b}_j={b}_j-\alpha\;\left({b}_j-{g}_i\right) $$
(11)
where α represents variability, and the value decreases with increasing bj and gi affinity. Formula (11) improves the antibody by reducing the distance between antibody bj and antigen gi, but this method still has some limitations, such as bj being only close to the antigen and not focusing on learning from other antibodies. To make the target antibody obtain the advantage information of outstanding antibodies at the same time, the following variation rules are formulated:
$$ {b}_j={b}_j-\alpha \left[\;{r}_1\left({b}_j-{g}_i\right)+{r}_2\left({b}_j-{b}_0\right)\;\right] $$
(12)
where b0 represents the antibody with the highest affinity with gi. In the current network, r1 and r2 are weighting factors, meeting the requirement of r1+ r2 = 1; if b0≡bj, r1 = 1. In fact, when r1 = 1, it degenerates into the mutation strategy of a traditional algorithm.
Compression threshold determination
There is no unified understanding of how to set the compression threshold. The general guidance is to take a very small compression threshold first, for example, 10−3, and gradually increase it with the change in the network. There is little discussion on this in the existing literature. According to the concept and theory of shadow sets, we propose an adaptive method to determine the compression threshold.
Shadow sets is a theory proposed by Pedrycz to address fuzzy problems, in which set levels 1,0 and [0,1] are used to describe and simplify fuzzy relationships. The sample points corresponding to level 1 belong to a set completely, [0,1] indicates whether the sample point belongs to a set or not. The 0 corresponding sample point does not belong to a collection at all. The above three levels correspond to the complements of the lower approximation, upper approximation and lower approximation relative to the upper approximation.
The purpose of network compression is to improve the affinity between antibody and antibody, that is, to increase the distance between antibodies and prevent network redundancy caused by a small distance. The smaller the distance, the more likely it is to be compressed, and the greater the distance, the more likely it is to not be compressed. Without losing generality, we use the normalization of distance to express the possibility membership degree of whether the antibody should be compressed. The possibility membership of whether the antibody should be compressed is defined as the mapping of the distance between the antibody and the antigen to the [0,1] closed interval, expressed by the formula:
$$ {u}_{i,j}=\frac{d_{i,j}-{d}_{\mathrm{min}}}{d_{\mathrm{max}}-{d}_{\mathrm{min}}},\kern0.5em i,j=1,2,L\kern0.3em ,n $$
(13)
The objective function is defined as:
$$ \underset{\alpha }{\arg \min }F\left(\alpha \right)=\left|\;{\xi}_1+{\xi}_2-{\xi}_3\;\right|,\kern1em \alpha \in \left(0,0.5\right) $$
(14)
where a is in the range of (0,0.5], \( {\xi}_1=\sum \limits_{u_{i,j}\le \alpha }{u}_{i,j} \), \( {\xi}_2=\sum \limits_{u_{i,j}\ge 1-\alpha}\left(1-{u}_{i,j}\right) \), and ξ3 = card (I) represents modulo set A, and I = { i | α < uij < 1 − α}. When the α value is determined, the part of the antibody satisfying μi, j ≤ α needs to be compressed.
Obviously, according to this threshold determination method, a certain number of antibodies are compressed each time, which is consistent with the actual situation of network compression in the algorithm. In addition, F(α) is a simple step-like unimodal function that can be quickly solved by methods such as dichotomy.
Immune defense
According to the traditional aiNet clustering method, regardless of the nature of the antigen, the antibody can generate an immune response and then activate the antibody network. This is the main reason for the unclear structure of the immune network due to “abnormal” data such as noise and fuzzy boundaries.
The immune defense mechanism means that the immune system can attack, destroy, and clear “alien components” such as bacteria, viruses, and foreign bodies, which is a very important protection mechanism for the human body. We simulate this process in the algorithm.
To defend against “alien elements,” we must first identify the “alien elements” according to the state knowledge constructed in the cultural algorithm. It is convenient to determine the “alien component,” that is, the parts marked as the noise and boundary in the state knowledge.
In the clustering problem, because the boundary data easily cause the immune network structure to be unclear, it does not activate the immune network, which creates the problem that it may make the network unable to accurately express the distribution of the antigens.
To avoid this problem, three different methods are adopted to treat the antigens in noise, boundary and cluster according to the immune defense inhibition measures taken to avoid overdefense in medicine, namely,
$$ \left\{\begin{array}{l}{g}_i\in {S}^0,\kern1em \mathrm{Clonal}\kern0.5em \mathrm{do}\mathrm{minant}\kern0.5em \mathrm{antibody}\kern0.5em \mathrm{selection}\kern0.5em \mathrm{and}\kern0.5em \mathrm{variation}\\ {}{g}_i\in {S}^1,\kern1em \mathrm{do}\mathrm{minat}\kern0.5em \mathrm{antibody}\kern0.5em \mathrm{selection}\kern0.5em \mathrm{and}\kern0.5em \mathrm{mutation}\\ {}{g}_i\in {S}^2,\kern1em \mathrm{do}\kern0.5em \mathrm{not}\kern0.5em \mathrm{operate}\end{array}\right. $$
(15)
where S0, S1, and S2 represent the interior, boundary, and noise antigen set of the cluster, respectively, and the noise and boundary points are defended differently by the immune defense mechanism guided by state knowledge. If noise is no longer involved in the immune process, it is eliminated directly. Boundary points do not participate in the process of cloning to avoid the generation of a large number of cloned antibodies at the boundary and prevent the blurring of network structure at the boundary. The reason why boundary points participate in the selection and variation is to avoid the excessive movement of antibodies to the clustering center, resulting in a lack of affinity between the boundary and antibody network, thus leading to the problem of boundary point misclassification.
Specific steps
For the final immune network, the minimum spanning tree is generated according to its connected graph. There is a larger weight between the representative antibodies of two different clusters. According to the set pruning threshold, the m connections with larger weights are removed so that m+1 clusters can be obtained. The steps of the CaiNet algorithm are shown as:
After the data points in these units are eliminated, the dataset is recorded as X = {x1, x2, ⋯, xi, ⋯, xn}, and the clustering is recorded as C1,C2,⋯, Cj,⋯,Cm. Next, determine the type of data based on the distance between the data point and the antibody, that is,
$$ {d}^2\left({x}_i,{b}_l\right)=\min \left\{{d}^2\left({x}_i,{b}_k\right),k=1,2,\mathrm{L},m\;\right\},{b}_l\in {C}_j\Rightarrow {x}_i\in {C}_j $$
(16)