Data mining technology
In recent years, with the development of technologies such as data collection and storage, the data of the information society has exploded, and there has been a situation of “rich data and poor information.” Massive data not only makes it difficult to distinguish useful data but also increases the complexity of data analysis. In order to solve this problem, data mining technology came into being [18, 19]. The birth of data mining aims to transform a large amount of data that could be widely used in society into useful knowledge and information for market analysis, fraud monitoring, hacking, product control, and scientific exploration.
In general, data mining can be divided into the following seven steps (Fig. 2):
-
(1)
Data cleansing—eliminating noise and data not related to the mining theme.
-
(2)
Data integration—integrating data from multiple data sources.
-
(3)
Data selection—select data related to mining topics.
-
(4)
Data transformation—using data such as normalization to transform data into a form suitable for data mining.
-
(5)
Data mining—the core steps to mine knowledge using methods such as classification, fusion, and association rules.
-
(6)
Mode evaluation—evaluation of the effect of the model, the commonly used indicators have accuracy, recall rate, etc.
-
(7)
Knowledge representation—the model is represented by a technically understandable model, and the knowledge obtained by the mining is presented to the user.
Time series data mining algorithm
The increasing non-equilibrium data in recent years has brought new challenges to data mining research. When a classification model is established for an unbalanced data set, the cost of misclassifying less types of data is more expensive than that of misclassified multi-class data, so the traditional classification method is not applicable to it. Nowadays, research on unbalanced data sets is concentrated [20]. There are two main types, one is to undersample multiple data, and the other is to generate some small data manually, but these two methods are for time series data which is not applicable. Ojha tried to set different classification penalty parameters for different categories of samples in the process of establishing classification model, and proposed weighted support vector machine WSVM [21]. In addition, Ram [13] et al. proposed a support vector machine FWSVM [22] which is based on feature weighting. Firstly, the information gain is used to evaluate the importance of each feature corresponding to the classification task, and the weights are respectively assigned, and then the weight coefficients are applied to the calculation of the SVM kernel function. Based on the sample weighted support vector machine and the feature weighted support vector machine, a dual-weighted support vector machine (DWSVM) is proposed, which maintains the advantages of WSVM and FWSVM.
The dual-weighted support vector machine algorithm DWSVM is proposed in this paper, first, different misclassification penalty coefficients to different sample categories when constructing the classification hyperplane is assigned, then the weights of each feature and reconstructs the kernel function is calculated, thus which makes the algorithm better, generalization and robustness.
Introducing different penalty factors
For the two-category problem, the multi-class is a negative class, and the other is a positive class. Different types of samples are given different error-discriminating coefficients:
$$ {\displaystyle \begin{array}{l}\min\ f(x)=\frac{1}{2}{\omega}^T\omega +{C}^{+}\sum \limits_{y=1}{\xi_i}^y+{C}^{-}\sum \limits_{y=-1}{\xi_i}^y\\ {}s.t.\kern0.5em y\left({\omega}^T\vartheta (x)\right)+b\ge 1-{\xi}_i\\ {}{\xi}_i\ge 0,i=1\dots T\end{array}} $$
(1)
In which, C+ is the penalty factor of the misclassified positive sample, C− is the penalty factor of the misclassified small sample, and the problem of (1) is converted into the Wolfe dual problem by the convex quadratic programming method in the optimization theory. After getting it,
$$ {\displaystyle \begin{array}{l}\underset{\alpha }{\max}\sum \limits_{i=1}^l{\alpha}_i-\frac{1}{2}\sum \limits_{i=1}^l\sum \limits_{j=1}^l{\alpha}_i{\alpha}_j{y}_i{y}_j\phi \left({x}_i\right)\left({x}_j\right)\\ {}{y}_i=1,0\le {\alpha}_i\le {C}^{+}\\ {}{y}_i=-1,0\le {\alpha}_i\le {C}^{-}\end{array}} $$
(2)
In which, ϕ(xi)(xj) is Kernel function.
Kernel function based on feature weighting
The definition of feature weighting is based on the degree of contribution of each feature of the sample to the pattern recognition, giving different weight coefficients. There are many methods for feature weighting, such as the information weight-based feature weighting method proposed by Wang Yan et al. The feature weighting method based on manifold learning is adopted here. A weighting matrix P is defined as follows:
$$ P=\left(\begin{array}{ccc}{\chi}_{11}& {\chi}_{12}& {\chi}_{13}\\ {}{\chi}_{21}& {\chi}_{22}& {\chi}_{23}\\ {}{\chi}_{31}& {\chi}_{32}& {\chi}_{33}\end{array}\right) $$
(3)
The function of the kernel function is to find the nonlinear mode by using the linear function in the eigenvector space established by the nonlinear feature map [23]. According to Theorem 1 and Theorem 2, the matrix P is introduced, the shape of the input geometric space can be scaled, and the geometry of the feature space can be scaled, thereby changing the weights assigned to different linear functions in the feature space during the modeling process.
Theorem 1:K is made a kernel function which is defined on XX, which is a mapping from input space to feature space. X-F, P is a linear transformation matrix, then
$$ \parallel \vartheta \left({\chi}_i\right)-\vartheta \left({\chi}_j\right)\parallel \ne \parallel \vartheta \left({\chi}_i\right)\parallel -\parallel \vartheta \left({\chi}_j\right)\parallel $$
(4)
Theorem 2: If there is 1 < _k < _h, then the kth eigenvector of the data set is independent of the calculation of the weighted kernel function, and has nothing to do with the output of the classifier. The smaller the ωk, the smaller the influence is on the calculation of the kernel function, and the smaller the effect is on the classification result.
Therefore, by introducing P into the Gauss radial basis kernel function, the following formula can be obtained.
$$ {\displaystyle \begin{array}{l}f\left(\chi \right)=\operatorname{sgn}\left(\sum {\alpha}_i{k}_i\left(\parallel {x}_i^TP-{x}_j^TP\parallel \right)+b\right)\\ {}\sum {k}_i\left(\parallel {x}_i^TP-{x}_j^TP\parallel \right)=\sum \exp \left\{-\frac{\parallel {x}_i^TP-{x}_j^TP\parallel }{\sigma^2}\right\}\end{array}} $$
(5)
Using the weighted kernel function in equation (5) for the support vector machine classifier, the classification model with weighted sample features can be obtained.
Double weighted support vector machine
According to the introduction of the previous two sections, the construction steps of the dual feature support vector machine (DWSVM) are as follows:
-
Step 1. Collects the data x, and the feature set in x is (fs, n), where n is the number of the feature, that is, the feature of X is represented by fs: (f1, f2,…..,fn).
-
Step 2. Calculate the weight coefficient ωi of the fi feature off by the MBFS method, and generate a linear transform weight matrix P which is based on ωi.
-
Step 3. Transform the Gauss kernel function with a linear transformation weighting matrix P, which is obtained a kernel function based on feature weighting (4).
-
Step 4. Construct a minimized structural risk function (1) which is based on the weighting of the two classification samples, and add the kernel function in equation (4) to the construction of the classification hyperplane is to establish a support vector machine classification model.
-
Step 5. Evaluate the obtained classifier.
Unbalanced classification evaluation index
In the process of establishing the SVM model, we continuously debug the parameters of the SVM (including the kernel function t, the penalty parameter c, the kernel function parameter g, and the weighting coefficients w0 and w1, etc.) to obtain better prediction results. Because in the unbalanced classification, classification accuracy could not be used as an evaluation index to measure classification performance, we also need to select appropriate model evaluation indicators according to actual needs. In addition to introducing the classification correct rate Acc, the evaluation indicators such as G-mean, F-measure and auc-roc are also selected.
The classification correct rate Acc represents the proportion of the sample with the correct classification to all samples. The total number of samples is M, and the number of samples with the correct classification is TM.
Define the classifier’s Acc.
$$ \mathrm{Acc}=\frac{TM}{M} $$
(6)
The TP rate and FP rate of the classifier are given by the following definition, defining the TP rate of the classifier.
$$ TP\ \mathrm{rate}=\frac{TP}{TP+ FN} $$
(7)
Define F-prate of the classifier:
$$ FP\ \mathrm{rate}=\frac{FP}{FP+ TN} $$
(8)
First define the sensitivity and specificity of the classifier. Define the sensitivity of the classifier:
$$ \mathrm{sensitivity}= TP/\left( TP+ FN\right) $$
(9)
Define the specificity of the classifier:
$$ \mathrm{specificity}= TN/\left( TN+ FP\right) $$
(10)
By definition, the sensitivity is positive class sample accuracy, and specificity is negative class sample accuracy. Based on the above two indicators, Ku-bat et al. proposed a new metric G-means to evaluate the unbalanced classification, which is given by the following definition.
Define the classifier’s G-means:
$$ G-\mathrm{mean}=\sqrt{\mathrm{specificity}+\mathrm{sensitivity}} $$
(11)
From the definition point of view, G-means takes into account the positive and negative precision of the class, and can better reflect the comprehensive performance of the classifier. Many researchers use G-mean as a measure when evaluating unbalanced classification performance.
In some special applications, more attention is paid to the classification performance of sample positive categories, such as credit card fraud detection, customer churn in telecommunications, arrears forecast, intrusion r in intrusion detection, abnormal state, and disease monitoring in medical diagnosis. Wait F-measure is mainly used to measure the classification effect of positive samples.
First, the definitions of the classifier precision and recall are given.
Define the classifier’s precision:
$$ \mathrm{precision}=\frac{TP}{TP+ FP} $$
(12)
Define the classifier’s recall:
$$ \mathrm{recall}=\frac{TP}{TP+ FN} $$
(13)
By definition, precision is the positive class sample coverage, and recall is the positive class sample accuracy. Based on the above two indicators, F-measure is given by the following definitions:
Define the F-measure of the classifier:
$$ F\hbox{-} \mathrm{measure}=\frac{\left(\beta +1\right)\mathrm{precision}\times \mathrm{recall}}{\beta \mathrm{precision}+\mathrm{recall}} $$
(14)
Usually β = 1. It can be seen from the above definition that F-measure fully embodies the classification performance of positive classes and is a trade-off between positive coverage and accuracy. Many researchers use F-measure as a measure when evaluating unbalanced classification performance.
Define the AUC of the classifier:
$$ \mathrm{AUC}(f)=\frac{\sum \limits_{i=1}^{n^{+}}\sum \limits_j^{n^{-}}f\left({x}^{+}\right)>f\left({x}^{1-}\right)}{n^{+}\times {n}^{-}} $$
(15)
The middle n+ is the number of samples of all the minority classes, and n− is the number of samples of all the majority classes. For any sample of a few classes, if the probability that the classification algorithm f can divide it into a minority class is greater than the probability of dividing it into a majority class, the value of f (x+) > f (x−) is 1, otherwise 0, the same is true for the AUC for solving most classes. Then multiply the two and divide by the product of the minority class and the majority class to get the AUC value.