- Research
- Open access
- Published:
Research on detection and integration classification based on concept drift of data stream
EURASIP Journal on Wireless Communications and Networking volume 2019, Article number: 86 (2019)
Abstract
As a new type of data, data stream has the characteristics of massive, high-speed, orderly, and continuous and is widely distributed in sensor networks, mobile communication, financial transactions, network traffic analysis, and other fields. However, due to the inherent problem of concept drift, it poses a great challenge to data stream mining. Therefore, this paper proposes a dual detection mechanism to judge the drift of concepts, and on this basis, the integration classification of data stream is carried out. The system periodically detects data stream with the index of classification error and uses the features of the essential emerging pattern (eEP) with high discrimination to help build the integrated classifiers to solve the classification mining problems in the dynamic data stream environment. Experiments show that the proposed algorithm can obtain better classification results under the premise of effectively coping with the change of concepts.
1 Introduction
With the continuous advancement of information technology and the rapid development of computer networks, the real world has generated a large number of data stream, such as weather monitoring data, stock trading data, and network access logs, etc. And as time goes on, the amount of data is constantly expanding, resulting in unstable data distribution, which is easy to generate drifting of concepts. At this point, timely identification data stream with concept changes and accurate classification has become a research hotspot of data mining.
In recent years, the problems of concept drift has attracted more and more scholars’ attention, and it has also proposed more reasonable solutions. In general, the mainstream algorithms for dealing with concept drift can be summarized as two types: direct algorithms and indirect algorithms. Initially, the most popular algorithms use a number of detection metrics to directly judge concept drift, such as the most commonly used entropy values [1] and error rates, and judging these metrics can measure concept changes, and even to estimate the degree of drift.
In addition to the above, other scholars indirectly judge the drift by the process of classification. In 2000, Street proposed the SEA (Streaming Ensemble Algorithm) [2], which introduced the integration learning to classification of data stream with concept drift for the first time. This method achieved a rapid response to the change of concepts and proved that it can adapt to any size of data stream. In 2007, the DWM (Dynamic Weighted Majority) algorithm was proposed in [3], which dynamically adjusted the weight of each base classifier for integration and effectively tracking the abrupt concept drifts. Sun et al. [4, 5] proposed an online integration classification algorithm, which updated the weight of the base classifier online and added or deleted the base classifier by weights, thus solving the classification problem of dynamic data stream while adapting to concept drift.
Based on the research progress [6, 7] of related scholars, this paper firstly proposes to use the dual detection mechanism based on classification error to monitor the concept drift, mainly by multi-dimensional comprehensive judgment of the Mahalanobis distance and μ value of the data stream samples. Secondly, under the background of concept drift, a classification algorithm [8, 9] based on EP is proposed to improve the accuracy of overall integration classifiers. Finally, the drift detection can be achieved while adjusting the performance of the classifier itself. The remainder of this paper is organized as follows. Section 2 presents a mechanism for detecting concept drift. Section 3 introduces an integration classification algorithm based on emerging patterns, and in Section 4, we proposes our major algorithm. Section 5 shows the experimental results of the proposed algorithm and analyzes them. Finally, it is summarized in Section 6.
2 A dual concept drift detection mechanism based on error rate
2.1 Mahalanobis distance detection standard based on error rate
As for high-dimensional datasets, Mahalanobis distance has a more significant advantage in calculation than the Euclidean distance. It is fully recognized by considering the correlation between different attributes of the dataset and independent to measurement scale.
Suppose A = (a1, a2, … , ai, …an), a1 ≠ aj, then the Mahalanobis distance between ai and aj is defined as
The calculation formula of the covariance matrix S is
Among them, μi = E(ai) is used to represent the expectation value of each vector.
Therefore, for dataset A = (A1, A2, … , Ai, …An, …)T, data stream is sequentially processed in blocks for the convenience of the experiment. Where Ai represents the ith data block, the classification error rate on this data block is errori, and the error rate on each data block refers to the average classification error rate of all data on the data block. Then the Mahalanobis distance can be represented by a set of mean values μ = (μ1, μ2, …μn)T and a covariance matrix S, as shown in Eq. (3):
After calculation, the degree of error rate change on each data block can be obtained, which indirectly reflects the similarity of adjacent data blocks and compares with the experimental threshold value to conclude whether the drift actually occurs. The further the DM(A) deviates from the threshold, the greater the possibility of concept drift, indicating that the warning state is entered at this time.
2.2 μ detection standard based on error rate
The principle of μ test in statistics: Let X be an arbitrary sample set, and there are first and second order matrix, which are respectively recorded as EX = μ, DX = σ^2 (σ is unknown). A unilateral assumption on X is as follows: the null hypothesis H0: μ ≤ μ0 (μ0 is a constant) and the alternative hypothesis H1: μ > μ0.The test level α is 0.05 or 0.01, and the value of \( \overline{X} \) is to be tested. When the number of samples is large, that is, the value of n is large, the statistic \( U=\frac{\overline{X}-{\mu}_0}{S/\sqrt{n}} \), where \( \overline{\ X} \) is the average of the samples and S is the standard deviation of the samples. The statistic U obeys the standard normal distribution N (0, 1). According to the given test significance level α, there is μα that satisfies P{U > μα} ≈ α.
Suppose X have n samples, in which the number of misclassified samples is m, the average value of the misclassified subsamples \( \overline{X}=m/n \), and the subsample standard deviation \( S\hat{\mkern6mu} 2=\overline{X}\left(1-\overline{X}\right) \). At this point, the statistic U can be described to the following form:
The μ test method in the data stream environment is implemented on the basis of a certain model. Due to the particularity of data stream, the classification error rate on each data block is mainly tested and the initialization is the average of the classification error rates on the first i data blocks when the data distribution is stable. Therefore, the statistic U can be expressed as \( U=\frac{\mathrm{err}-{\mu}_0}{\sqrt{\mathrm{err}\left(1-\mathrm{err}\right)/n}} \). After each data block arrives, the change of the statistical U value is monitored. When U ≥ μα, the classification error rate is considered to rise significantly and the concept drift occurs. Otherwise, the concepts in the current data stream remain stable.
The dual detection mechanism proposed in this part is mainly to classify each data block with the classifiers and measure the corresponding error rate. Bringing the classification error rate into two different dimensions of Mahalanobis distance and μ test is for calculation. The conclusion of concept drift can only be made when the two-dimensional requirement is reached at the same time. The workflow of the dual concept drift detection mechanism is as follows.
Input: Dataset A, the length of the data block is L; threshold ε, significance level α.
Output: classification error rate erri on the ith data block, Mahalanobis distance DM(A),
μ test statistic U, the judgment of whether concept drift occurs.
Process:
-
1:
Data preprocessing ←Data blocks A1, A2, … Ai, Ai + 1, … An…
-
2:
Initialization: erri ←0, DM(A) ←0, U ← 0.
-
3:
For the arriving data block Ai
-
4:
Enter dual concept drift detection mechanism
-
5:
Apply the basic classification algorithm based on eEP to learn, return erri;
-
6:
Enter the Mahalanobis distance detection part
-
7:
Calculate the Mahalanobis distance by the formula of (3)
-
8:
If DM(A) > ϵ, a warning appears, marked as Re1;
-
9:
Enter the μ hypothesis test module
-
10:
The statistic of the current data block is obtained by the formula \( U=\frac{\mathrm{err}-{\mu}_0}{\sqrt{\mathrm{err}\left(1-\mathrm{err}\right)/n}} \)
-
11:
If U ≥ μα, indicating that the μ test hypothesis is not true, denoted as Re2;
-
12:
Take the intersection of the detection results of the two parts, Result = Re1∩Re2;
-
13:
The system determines that the concepts drift.
3 Integration classification algorithm based on EP
3.1 Basic concepts
Suppose the training data set DB consists of n samples, each of which contains m-dimensional attributes. It is assumed that n samples are divided into K categories C1, C2, … Ck. The duality of the attribute name and its corresponding value, that is, property name and attribute value constitutes a data item. I= { i1, i2 …, in }, which denote a set of all data items, then any subset X is called an item set.
Definition 1: Suppose D is a subset of training set DB and records the support of item set X on D as SupD(X), which is defined as SupD(X) = CountD(X)/ ∣ D∣, where CountD(X) represents the number of samples containing X of D, and ∣D∣ represents the total number of samples of D.
Definition 2: For the two datasets D and D′, the change of the item set X from D′ to D is the growth rate, marked as \( {\mathrm{GR}}_{D^{\prime}\to D}(X) \).
Definition 3: Set the growth rate threshold ρ > 1, if the growth rate of the item set X from D′ to D satisfies \( {\mathrm{GR}}_{D^{\prime}\to D}(X)\ge \rho \), then X is called emerging patterns (EP) from D′ to D and is referred as GRD(X).
Definition 4: If the item set X satisfies:
-
1)
X is the EP of D;
-
2)
The support of X in D is not less than the minimum support threshold ξ;
-
3)
Any true subset of X does not meet the conditions 1 and 2;
then X is called essential an emerging pattern (eEP), which is the basic EP.
3.2 Using eEP to establish base classifier
For large databases, especially high-dimensional datasets, eEP has more obvious advantages in terms of time and space complexity than EP. And eEP is the shortest EP, which greatly reduces the redundancy problem of EP in classification.
Taking the sample S as an example, we try to use the relevant theory of eEP to judge. Let Di be the set of Ci class training samples, \( {D}_i^{\prime } \) be the set of non-Ci class training samples, and X be the eEP of Ci class. If X does not appear in S, it cannot be judged whether S belongs to the Ci class. If X appears in S, X will have the probability of \( \frac{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1} \) to determine that S belongs to the Ci class and that S does not belong to the Ci class by the probability of \( \frac{1}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1} \). If \( \mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)=\infty \), \( \frac{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1}=1 \), and \( \frac{1}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1}=0 \).
At the same time, the eEPs of the non-Ci class also contributes to determining whether S belongs to the Ci class. Let Y be an eEP of the non-Ci class, which appears in S. If the growth rate of Y is large, the effect of Y on determining that S belongs to the Ci class is negligible. However, when the growth rate of Y is not too large (such as \( \mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)<5 \)), Y has a considerable influence on determining that S belongs to the Ci class. In general, we take the probability that S belonging to the Ci class is \( \frac{1}{GR\left(Y,{D}_i,{\mathrm{D}}_i^{\prime}\right)+1} \).
In order to classify the sample S, it is necessary to consider the effects of the eEPs of the Ci class and non-Ci class. Therefore, the concept of membership is introduced, and the possibility that S belongs to the Ci class is called the membership of S to Ci, denoted as Bel(S).
For i = 1,2,….K, let PS(S, Ci) = {X|X is eEP of Di, and X appears in S}, NS(S, Ci)=={Y|Y is eEP of \( {D}_i^{\prime } \), and Y appears in S}. The membership value of S belonging to the Ci class is calculated by:
The probability of S belonging to each class is calculated by the above formula, and then S is classified by the following rules. S is classified as the class with the largest degree of membership. If the class with the highest degree of membership is not unique, it is determined by a majority voting strategy.
3.3 Integrate base classifier based on eEP
Considering the temporality and fluidity of data stream, the research in this paper is carried out in the sliding window. Suppose SW is a fixed-size sliding window, K is the number of basic windows in the sliding window. BW is the basic window, labeled as bw, and its length is |BW|. The trained base classifier of basic window bwi is Ei.
In order to reflect the classification contribution of each base classifier in the integration classifier to the test dataset, we need to assign a weight to each classifier and introduce a weighting method based on classification error. For samples of (x, c), where c is a real class label, the classification error of Ei is 1−\( {f}_c^i(x) \), where \( {f}_c^i(x) \) is determined by Ei that the probability of x being class c. Therefore, the mean square error of Ei is
The mean square error of the classifier when making random predictions is \( {\mathrm{MSE}}_r={\sum}_cp(c){\left(1-p(c)\right)}^2 \)
It can be obtained from prior knowledge that MSEr is used as the threshold for weighting the classifier. To simplify the calculation, the weight wi is calculated using the following formula.
The integration algorithm is as follows:
Input: Sup, GR, K total number of base classifiers; D data contained in the basic window bwk + 1; E set of K-base classifiers before adjusting weights;
Output: the top K-base classifiers with the highest weight in E∪{Ek + 1}
-
(1)
Initialize K, Sup, GR;
-
(2)
While(bwk + 1 arrives) {
-
(3)
Train (D, Sup, GR); / / training base classifier Ek + 1
-
(4)
Calculate the error rate of Ek + 1 on D (10-fold cross-validation);
-
(5)
Calculate the weight wi + 1 corresponding to Ek + 1 using Eqs. (7) and (8);
-
(6)
for(Ei ∈ E) {
-
(7)
Ei ← Train(Ei, D);
-
(8)
Calculate the MSEi of Ei on D; //Formula (1)
-
(9)
Calculate Ei corresponding weight wi; //Formula (2)
4 Integration system under the environment of data stream with concept drift
In order to deal with the integration classification problem in the data stream environment, this paper proposes a weighted classification and update algorithm of data stream based on concept drift detection (WUDCDD) to better adapt to the change of concept. The specific process is described as follows:
-
(1)
Building an integration classifier
It constructs the base classifier on the basic window with eEP as the classification factor and then constructs the K-base classifiers to form the integrated classifier E. When the sliding window reaches the (K + 1)th basic window, training the base classifier Ek + 1 and calculating the classification error rate of each base classifier Ei. Then weighting and selecting the K-base classifiers with the highest weight as the output according to the weighting method proposed in Section 3.3.
-
(2)
Concept drift detection
The data stream in each basic window is divided into data blocks, and then the classification algorithm established by eEP as a classification factor is used to learn the model to obtain its classification error rate, and when a new data block is reached, the current integration model is utilized to classify it. The Mahalanobis distance from the classification error rate of the previous data block and the current block is calculated. If the distance exceeds a certain threshold condition ε, it is judged that there is a high probability that a concept change will occur, and the warning state is entered at this time. On this basis, the next hypothesis verification is carried out. If the classification error rate on the new data block is significantly increased, the system comprehensively judges the concept drifts.
-
(3)
Updating classifiers
This part performs integration of classifiers by weighting each base classifier, and the weight of each base classifier uses the classification error rate. If the concept drift detection module determines that concept drift occurs, the data block in the current window is used as a training set, and each base classifier is relearned. And comparing the weights of the learned base classifiers, selectively eliminating or retaining the old base classifiers while keeping the total number of base classifiers remains unchanged, so that the updated system is more suitable for the current data stream environment.
5 Experimental results and discussion
5.1 Dataset
Artificial data stream is a simulation of changing concepts by rotating hyperplanes. The hyperplane on the d-dimensional space is a set of points x that satisfy the following conditions:
Where xi is the ith coordinate of point x. The samples satisfy \( \sum \limits_{i=1}^d{w}_i{x}_i>{w}_0 \) and are marked as positive samples, and other samples satisfy \( \sum \limits_{i=1}^d{w}_i{x}_i<{w}_0 \) and are marked as negative samples. When simulating the time-varying concepts, we adjust the orientation of the hyperplane smoothly by adjusting the corresponding weight wi, so the hyperplane is very important. In the experiment, the training set size is 10,000, the test set size is 1000, a total of 10 dimensions, the number of different values in each dimension is 4, and the noise rate is 5%.
6 Results and discussion
In the following section, we mainly compare the accuracy of the proposed algorithm WUDCDD, GK (representing a single classifier trained on a sliding window that the size is K) and EC4.5 (integration classifiers based on a single classifier of C4.5) under different conditions. The accuracy is mainly compared from four aspects: (1) the influence of the size of the basic window on the change of classification accuracy, (2) the impact of the size of the sliding window on the change of accuracy, (3) the effect of the dimension of the drift on the change of accuracy, (4) the influence of the dimension of data stream on the change of accuracy.
Experiment 1 Testing the variation of classification accuracy with the basic window size (|BW|). The experiment sets 2 dimension drifts and changed weights every 1000 samples. Figure 1 shows the average classification accuracy of the algorithm under different basic windows when the basic window is [250, 1500].
It can be seen from Fig. 1 that the proposed algorithm is more effective than the corresponding single classifier GK. When |BW| ≤ 250 × 3, the accuracy of WUDCDD is higher than EC4.5. When 250 × 3≤ |BW|≤ 250 × 6, it is comparable to EC4.5.The accuracy decline of WUDCDD is not as obvious as EC4.5 when the range is [250 × 4, 250 × 6], because we incrementally update each model before calculating the weight of each base classifier. It can better adapt to the concept drift.
When the basic window is small, each algorithm has better classification performance. Because the window contains less concept drift, the distribution of data is more stable. However, if it is too small, the accuracy is reduced because there is not enough data to train the base classifier. When the window is too large, it is difficult to detect whether the drift occurs, which also affects its performance. When the window is too small, we can improve the base classifier performance by reducing support.
Experiment 2: Testing the variation of classification accuracy with sliding window size and the parameters of the dataset are the same as experiment 1. Figure 2 shows the average accuracy under different basic windows for sliding windows of 2, 4, 6, and 8, respectively.
Figure 2 shows that as the sliding window increases, and the accuracy of WUDCDD and EC4.5 increases continuously and has better performance than GK with the reason of GK not adapting well to concept drift. Moreover, the performance of WUDCDD and EC4.5 increases rapidly at the beginning and then the increase is gradually reduced. Because of the better detection of drift, the increase of base classifiers will have a weak effect on the classification performance. When |SW| < 8, WUDCDD is slightly better than EC4.5, because the former single classifier performance is better than C4.5. When K > 8, the performance of both is close.
Experiment 3: Testing the effect of the dimension of the drift on the accuracy. When the 2, 4, 6, and 8 dimensions are set to drift, the result is shown in Fig. 3. It can be seen that with the increase of drift, the accuracy of each algorithm drops sharply and then stabilizes. GK is most affected because there is no mechanism for processing drift. When the range of varying dimension is [2,4], the performance difference between WUDCDD and EC4.5 is very small. When the range of varying dimension is [4,8], the accuracy of the latter decreases more obviously. Because there is no incremental adjustment decision tree with new data arriving, WUDCDD always maintains the most discriminating eEP and constantly adjusts the obtained EP to reflect the characteristics of the data.
Experiment 4: Testing the effect of the total number of dimensions on accuracy, |BW| = 250 and K = 6. The experimental results are shown in Fig. 4.
According to the trend of the curve in the figure, it can be seen that as the dimension of the dataset increases and the accuracy rate decreases. This is because of the increase of dimensions leading to the large number of eEPs. But the support and growth rate generally decreases, thereby reducing the discrimination of eEPs and resulting in the decline of classification ability. So the accuracy of WUDCDD also decreases. As the number of dimensions increases, the number of classification rules of EC4.5 increases, which also causes the decrease in accuracy. When the accuracy of WUDCDD drops, we can adjust by lowering the support threshold.
7 Conclusions
How to train models from massive data to effectively predict future data stream has become a hot topic. The traditional data classification algorithm can not be directly applied to the data stream environment, therefore, this paper innovatively introduces the eEP classification algorithm into the data stream classification field and proposes the algorithm of detection and integration classification based on the data stream with concept drift. By comparing with the other two algorithms, it is proved that the proposed algorithm can better adapt to the data stream with concept drift and has better classification accuracy, which is also sufficient to compare with the integration algorithm based on the C4.5. Finally, through the experimental result, it can be seen that the update strategy of the algorithm in the sliding window needs further research and improvement in order to apply to more specific fields such as data mining.
Abbreviations
- eEP:
-
Essential emerging pattern
- WUDCDD:
-
Weighted classification and update algorithm of data stream based on concept drift detection
References
P. Vorburger, A. Bernstein, in International Conference on Data Mining. Entropy-based concept shift detection (2006), pp. 1113–1118
W.N. Street, in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. A streaming ensemble algorithm (SEA) for large-scale classification (2001), pp. 377–382
J.Z. Kolter, M.A. Maloof, in Proceedings of the 3rd IEEE International Conference on Data Mining. Dynamic weighted majority: A new ensemble method for tracking concept drift (2003), pp. 123–130
Y. Sun, G.J. Mao, X. Liu, C.N. Liu, Concept drift mining in data stream based on multi-classifier. J. Autom. 34(1), 93–97 (2008)
L.L. Minku, A.P. White, X. Yao, The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans. Knowl. Data Eng. 22(5), 730–742 (2010)
K. Nishida, K. Yamauchi, in proceedings of International Conference on Discovery Science. Detecting concept drift using statistical testing (Springer-Verlag, Berlin Heidelberg 2007), pp. 264–269
R. Elwell, R. Polikar, Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 22(10), 1517–1531 (2011)
M. Fan, M.X. Liu, H.L. Zhao, A classification algorithm based on basic exposure mode. Comput. Sci. 31(11), 211–214 (2004)
L. Duan, C.J. Tang, N. Yang, C. Gou, Research and application progress of contrast mining based on revealing mode. J. Comput. Appl. 32(02), 304–308 (2012)
Acknowledgements
Not applicable
Funding
This paper is supported by Natural Youth Science Foundation of China (61401310) and Tianjin Science Foundation (18JCYBJC86400).
Availability of data and materials
All data generated or analyzed during this study are included in this published article.
Author information
Authors and Affiliations
Contributions
BJZ analyzed and proposed the significance of current data mining and data analysis and carried out the experimental verification. YDC analyzed the experimental data and become a major contributor to the writing of manuscripts. The final draft was read and approved by both authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Zhang, B., Chen, Y. Research on detection and integration classification based on concept drift of data stream. J Wireless Com Network 2019, 86 (2019). https://doi.org/10.1186/s13638-019-1408-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13638-019-1408-2