A feature selection method based on synonym merging in text classification system
© The Author(s) 2017
Received: 31 July 2017
Accepted: 27 September 2017
Published: 5 October 2017
As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.
With the development of the Internet, the amount of Chinese text information shows an exponential growth trend. How to effectively manage the massive Chinese documents and mine the information contained in the documents has become a critical research problem. Automatic text classification can complete the work of text processing effectively. It also plays an important role in natural language processing (NLP) and data mining.
From Fig. 1, we know that the first step in Chinese text classification is to preprocess the text, including word segmentation, part of speech tagging, and removal of stop words. The purpose is to remove the useless words and only leave the nouns, adjectives, and verbs that contain category information. After this, the text can be represented as a vector to form VSM. Then, we use the feature selection method to select the feature words that can symbolize the text categories, and merge the synonym to reduce dimensions. Next, TF-IDF  method is used to calculate the weight of each feature of each text to transform the text into a feature vector. Last but not least, by using the Bayesian classifier to train the sample data, we can get the final text classifier.
We presented a new feature selection algorithm named SM-CHI based on an improved CHI  formula and synonym merging to achieve efficient feature selection and dimension reduction.
The choice of thresholds α(0≤α≤1) is critical. Only the most similar feature words will be merged when α is close to 1, so we use grid search method to find the optimal α. The result show that the classification accuracy is highest when α is equal to 0.8.
We proposed three improved weight calculation methods based on TF-IDF. The experimental results show that using the maximum value of the synonym group as the feature weight is the best way.
The organization of this paper is as follows. Next section discusses some related works. The text classification system based on semantic similarity is introduced in Section 3. We present the details of SM-CHI and the corresponding weight calculation methods in Section 4. In Section 5, we show and discuss the experimental results. We conclude our work and describe future work in Section 6.
2 Related works
Text classification has been widely used in the classification and labeling of news and web pages. There are many works on text classification. The classification methods based on synonym merging have appealled much attention in the past 2 years. Next, we briefly introduce some of the existing research results.
2.1 Classification model
Nowadays, most of the text classification methods are based on VSM where the texts are represented in the form of (feature vector, label). In this way, we can transform each text into feature vectors to construct the training and testing dataset. It is a long existing problem of machine learning to train a classifier for text classification. Many machine learning algorithms can be used as the classifier, such as Naive Bayes Method , k-Nearest Neighbor , Decision Trees , Support Vector Machines (SVM) , Markov Model , and Neural Networks . KNN as one of the best classifiers in VSM is simple, effective, non-parameter. So we choose it as the classification method in this paper.
2.2 Feature selection algorithm
In order to construct a valid classifier, feature words selection is another main problem. Common feature selection methods include document frequency (DF), mutual information (MI) , a χ 2 statistics (CHI), information gain (IG) [6, 13], and term strength (TS) [14, 15]. The works in [2, 3] show that CHI is the best feature selection method through contrast experiments. However, CHI feature selection method also has shortcomings . For example, the CHI value of the high-frequency words is very high, but they have no significant contribution to class distinctions. Therefore, the authors of [4, 5] presented two improved CHI formulas from different perspectives in order to make up for the lack of the original CHI method.
2.3 Synonym merging
With the development of text classification, some researchers start to propose a text classification system based on synonym merging to improve the accuracy of classification. The word similarity calculation methods based on “Tong YiCi Cilin” and “HowNet” were proposed in [17, 18] respectively. The work in  proposed a text feature selection method based on “TongYiCi Cilin” to reduce data’s feature dimensions while ensuring data integrity and classification accuracy. A semantic kernel is used with SVM for text classification to improve the accuracy in [20, 21]. What is more, there are some other work in [22–24] that presents some excellent ideas, which is worth learning and reference when we are dealing with large-scale text classification. They can help us to speed up the calculation through big data technology.
The model proposed in this paper is a text classification model based on synonym merging, named SM-CHI. The difference with  is that we merge synonyms after feature selection based on CHI and we propose three improved weighting method for the merged feature words.
3 Text classification model based on semantic similarity
In this section, we mainly introduce the text classification model based on semantic similarity. Wherein, Section 3.1 describes the feature selection method based on χ 2 statistic. Section 3.2 introduces the method of synonym merging; Section 3.3 presents the traditional weight calculation method, TF-IDF.
3.1 Improved feature selection algorithm
where N is the size of the training set; A is the number of documents that belong to class c and contain the word t; B is the number of documents that do not belong to class c but contain the word t; C is the number of documents that belong to the class c but do not contain the word t; and D is the number of documents that do not belong to class c and do not contain the word t.
where A+B represents the number of documents that contain word t and N represents the total number of documents. In this case, the CHI value of the high-frequency words that appear in all categories are close to zero so that they would not be selected as a feature word.
In the formula, m is the total number of categories and tf(t,c) is the frequency of the word t in the category c.
3.2 Synonym merging algorithm based on “Tong YiCi Cilin”
Some of the feature words selected by CHI formula may be the same or have similar meaning. They have the same effect on class distinctions. If the synonym are merged, not only the classification accuracy will be improved, but also the dimension of the feature space can be reduced so that the efficiency of the algorithm can be improved. For example, “GanMao”, “ZhaoLiang”, “ShangFeng” are the synonym of “Cold” in Chinese. If the “Health Care” category articles contain these words respectively, then the feature words for the text classification contain too much redundant information. We use the method of synonym merging to deal with it.
The concrete similarity calculation method is introduced in detail in . When the similarity of two words is greater than threshold α, they will be regarded as a pair of synonym to merge. The optimal value of α will be discussed later in the experiment. In addition, all merged synonyms are stored in a list. Nested lists are used to store feature words so that all synonym information remains in the feature vector. To calculate the feature vector of each document, we propose three improved methods which will be discussed in Section 3.3.
3.3 Weight calculation method
As we mentioned above, the TF-IDF method is used to calculate the weight of each feature word in each text.
4 Algorithm description
4.1 Feature selection method based on the synonym merging
where L F(t) denotes whether the word t exists in the word bag or not, and is mainly decided according to the part of speech and stopping words. If the word t is a stopping word and in the part of speech that does not belong to verb, noun, and adjective, L F(t)=0, otherwise L F(t)=1. C H I(t) represents the CHI value of the word t and is calculated by Eq. (2). S M(t) indicates whether the word t contains synonym. If yes, it needs to merge all of its synonyms.
Firstly, all the texts in the training set are preprocessed, including Chinese word segmentation, part-of-speech tagging, and discarding stop words. The remaining words constitute the word bag of the training set. Secondly, we calculate the CHI value of each word. Choose the first 200 words from each category to form candidate sets of feature words. Note that the characteristic words selected for each category may be duplicated. The candidate set is stored using the HashSet (a data structure) and the de-emphasis is performed. After obtaining the candidate set, the similarity between each word is calculated according to “Tong YiCi Cilin” and threshold is set to α. The synonym merging is performed only when the word similarity is greater than α. We will experimentally determine the optimal value of hyper-parameter α. The pseudocode of SM-CHI is shown in Algorithm 1.
4.2 Improved method for calculating eigenvalue weight
Sum the weights of all items up in the feature list of each dimension as the weight of the list;
Take the largest weight among the synonym as the weight value of the feature;
Multiply the first item by 1.1 for times of the number of items in the feature list.
5 Experiments and results
In this section, three groups of experiments are designed to evaluate the performance of three CHI formulas, the optimal threshold α for synonym merging, and the performance of feature weight update method. We use the whole news data set from Sogou Lab  to test the accuracy of the experiment.
5.1 Performance evaluation and data set
where TP is the number of documents correctly classified as class i, FP is the number of documents classified as class i but not actually i, and FN is the number of documents that is not classified as class i but is actually class i.
This article will use the whole network news data set provided by Sogou Lab to test our experiments. The corpus includes nine kinds of news types, such as Automobile, Finance, and IT. Each category contains thousands of documents. In this experiment, each category takes 400 documents, of which 280 are training set and 120 are test sets. Therefore, the training set contains 2520 documents, and the test set includes a total of 1080 documents.
The preprocessing module uses a third-party library for python, named jieba, to complete the work of word segmentation, part of speech tagging, and discarding of stop words. In addition, we use Naive Bayesian classifier provided in Python’s NLTK library as the classifier.
5.2 Experiments and results
In this section, the following three groups of experiments are carried out to test the three innovation points of SM-CHI with control variable method. In Experiment I, we test the three feature selection algorithms without using synonym merges. In Experiment II, we use the first improved CHI method to select features and use grid search method to find the optimal threshold α. On the basis of Experiment I and II, we designed Experiment III to find the best weight update method.
5.2.1 Experiment I
From the results, we can see that the two improved CHI formulas have a great effect on enhancing the value of F1 score of each category as compared to the original CHI formula, which means that the improved CHI formulas can select more representative words. They both make some improvement based on the original CHI. In addition, when the two improved CHI formulas are compared, the first improved method has a slight advantage, showing a better discrimination effect in the preceding categories. The result also shows that the log term successfully suppresses the CHI values of the high-frequency words appearing in all classes, which achieves relatively good results. Therefore, we will use the first improved CHI formula as our base feature selection method in the fellow experiment.
5.2.2 Experiment II
F1 score of each category when α takes different values
5.2.3 Experiment III
According to Fig. 6, it can be seen that the classification accuracy of method 1 is the lowest. The reason is that this method adds the weights of all the synonymy words as the weight of the feature, but a word and its synonym words appear in more than one category. This simple superposition will make the feature differentiate the category worse. In contrast, methods 2 and 3 use the combined synonym as a one-dimensional feature and achieve better classification results and F1 scores. In contrast, method 2 is more effective which shows that the maximum value of the synonym is a better method because it can represent the maximum abilities of all synonyms to differentiate the text categories. By multiplying the power of 1.1 by the n, method 3 incorrectly increases the ability of the feature to distinguish text categories.
In this paper, we propose the SM-CHI feature selection method based on the common method used in Chinese text classification. This method mainly considers part of speech tagging, improved CHI formula and synonym merging. In addition, this paper proposes an update method for calculating the weight of feature words after synonym merging to obtain a more accurate vector representing of the text for classifier processing. The results of the experiment proves that the feature dimension can be reduced and the accuracy and effectiveness of text classification can be improved at the same time with this method.
In the future, we will focus on using synonym similarity calculation method based on “HowNet” instead of “Tong YiCi CiLin”, because “HowNet” uses more than 1500 generics to build a unique knowledge description form and rich lexical semantic knowledge, so that we can more accurately calculate the similarity of words. What is more, SVM and neural networks have become the mainstream classifier in the text classification system because of its high accuracy, so we will use SVM or neural network instead of Naive Bayesian classification.
This work is supported in part by the Shandong Provincial Natural Science Foundation, China (Grant No. ZR2014FQ018), in part by the BUPT-SICE Excellent Graduate Students Innovation Fund, in part by the National Natural Science Foundation of China (Grant No. 61471056), and in part by the China research project on key technology strategy of infrastructure security for information network development. The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
The idea was arisen from the discussions between HY and PZ. CL did the simulation and code implementation and wrote the Chinese version of the paper with the guide of PZ. HY wrote the Abstract and Conclusions. LW help us to translate the paper into English and made a lot of changes. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- wikipedia, tf-idf. https://en.wikipedia.org/wiki/Tf%E2%80%93idf. Accessed 01 Sept 2017.
- HT Ng, WB Goh, KL Low, in International ACM SIGIR Conference on Research and Development in Information Retrieval. Feature selection, perceptron learning, and a usability case study for text categorization (ACMPhiladelphia, 1997), pp. 67–73.Google Scholar
- Y Yang, JO Pedersen, in Fourteenth International Conference on Machine Learning. A comparative study on feature selection in text categorization (Morgan Kaufmann Publishers Inc., 1997), pp. 412–420.Google Scholar
- GC Feng, S Cai, in Fourth International Conference on Computer, Mechatronics, Control and Electronic Engineering. An Improved Feature Extraction Algorithm Based on CHI and MI (ICCMCEE, 2015).Google Scholar
- Y Tang, T Xiao, in International Conference on Computational Intelligence and Software Engineering. An improved χ 2 (chi) statistics method for text feature selection (IEEE, 2009), pp. 1–4.Google Scholar
- DD Lewis, M Ringuette, in Third Annual Symposium on Document Analysis & Information Retrieval. A comparison of two learning algorithms for text categorization (ISRI, 1994), pp. 81–93.Google Scholar
- Y Yang, An evaluation of statistical approaches to text categorization. Inf. Retr. J. 1(1), 69–90 (1999).View ArticleMathSciNetGoogle Scholar
- E Velasco, LC Thuler, CA Martins, LM Dias, VM Gonçalves, Automated learning of decision rules for text categorization. Acm Trans. Inf. Syst. 12(3), 233–251 (1994).View ArticleGoogle Scholar
- T Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features (Springer Berlin, Heidelberg, 1998).Google Scholar
- J Bhimani, N Mi, M Leeser, Z Yang, in IEEE International Conference on Cloud Computing. Fim: Performance prediction model for parallel computation in iterative data processing applications (IEEE, 2017).Google Scholar
- E Wiener, J Pedersen, AS Weigend, A neural network approach to topic spotting. Proc. Fourth Ann. Symp. Document Anal. Inf. Retr. (SDAIR). 92(3), 482–487 (1995).Google Scholar
- KW Church, P Hanks, Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1989).Google Scholar
- JR Quinlan, Introduction of decision trees. Mach. Learn. 1(1), 81–106.Google Scholar
- WJ Wilbur, K Sirotkin, The automatic identification of stop words. J. Inf. Sci. 18(1), 45–55 (1992).View ArticleGoogle Scholar
- Y Yang, in International ACM SIGIR Conference on Research and Development in Information Retrieval. Noise reduction in a statistical approach to text categorization (ACM, 1995), pp. 256–263.Google Scholar
- T Dunning, Accurate methods for the statis15 of surprise and coincidence. Linguist. 74 Comput. Dirk Geeraerts Stefan Grondelaers. 19(1), 61–74 (1993).Google Scholar
- J Tian, W Zhao, Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system. J. Jilin University. 28(06), 602–608 (2010).Google Scholar
- SJ Li, Word Similarity Computing Based on How-net, The third Chinese mandarin semantics seminar, (Taibei, 2002).Google Scholar
- YH Zheng, DZ Zhang, A text feature selection method based on tongyici cilin. J. Xiamen University. 51(2), 200–203 (2012).Google Scholar
- B Altınel, MC Ganiz, B Diri, A corpus-based semantic kernel for text classification by using meaning values of terms. Eng. Appl. Artif. Intell. 43(C), 54–66 (2015).View ArticleGoogle Scholar
- B Altınel, B Diri, MC Ganiz, A novel semantic smoothing kernel for text classification with class-based weighting. Knowl. Based Syst. 89:, 265–277 (2015).View ArticleGoogle Scholar
- J Wang, T Wang, Z Yang, Y Mao, N Mi, B Sheng, in International Conference on Computing, NETWORKING and Communications. Seina: A Stealthy and Effective Internal Sttack in Hadoop Systems (IEEE, 2017).Google Scholar
- Z Yang, J Wang, D Evans, N Mi, in International Workshop on Communication, Computing, and NETWORKING in Cyber Physical Systems. Autoreplica: Automatic Data Replica Manager in Distributed Caching and Data Processing Systems (IEEE, 2016).Google Scholar
- J Wang, T Wang, Z Yang, N Mi, B Sheng, in IEEE International PERFORMANCE Computing and Communications Conference. eSplash: Efficient Speculation in Large Scale Heterogeneous Computing Systems (IEEE, 2016).Google Scholar
- sogou, Sogou data. http://www.sogou.com/labs/resource/ca.php. Accessed 01 Sept 2017.
- S Qin, J Song, P Zhang, Y Tan, in International Conference on Fuzzy Systems and Knowledge Discovery. Feature selection for text classification based on part of speech filter and synonym merge (IEEE, 2015), pp. 681–685.Google Scholar