- Research
- Open Access
Data classification algorithm for data-intensive computing environments
- Tiedong Chen^{1},
- Shifeng Liu^{1},
- Daqing Gong^{1, 2}Email authorView ORCID ID profile and
- Honghu Gao^{1}
https://doi.org/10.1186/s13638-017-1002-4
© The Author(s). 2017
- Received: 29 August 2017
- Accepted: 1 December 2017
- Published: 20 December 2017
Abstract
Data-intensive computing has received substantial attention since the arrival of the big data era. Research on data mining in data-intensive computing environments is still in the initial stage. In this paper, a decision tree classification algorithm called MR-DIDC is proposed that is based on the programming framework of MapReduce and the SPRINT algorithm. MR-DIDC inherits the advantages of MapReduce, which make the algorithm more suitable for data-intensive computing applications. The performance of the algorithm is evaluated based on an example. The results of experiments showed that MR-DIDC can shorten the operation time and improve the accuracy in a big data environment.
Keywords
- Data-intensive
- Data mining
- MR-DIDC
- MapReduce
1 Introduction
Microblog, personal blog, and photo and video sharing based on Web2.0 have produced large amounts of new data on the Internet and wireless network; among which data stored in semi-structured XML documents, HTML documents, and unstructured photos, audio and video are becoming increasingly abundant. There are massive amounts of data in the fields of web search, commercial sales, biomedicine, natural observation, and scientific computing [1]. Although the data types are diverse, they have common characteristics including massive scale, rapid changes, distributed storage, heterogeneity, and semi-structured or non-structured features. Outlier-mining algorithms based on data flow cannot satisfy the needs [2]; thus, data-intensive computing has emerged to satisfy the needs of obtaining, managing, analyzing, and understanding massive and rapidly changing data effectively [3].
In a data-intensive computing environment, massive amounts of data must be filtered, analyzed, and stored. Algorithm efficiency is not the only aim; distribution, effectiveness, and availability in a heterogeneous data environment are also considered. Massive data sets that change rapidly require high data storage efficiency [4–10]. In addition to the algorithm efficiency, we consider the effectiveness and the feasibility of the algorithm in distributed and heterogeneous data environments [11–15]. The network transmission speed restricts the free transfer between different computers. The bottleneck of data management and task analysis lies not only in computational capacity but also in data availability, namely, whether the network transfer speed can match the speed of system collection, processing, and analysis [16]. Obtaining the needed information is the current focus of data-intensive computing.
Clustering is one of the most important technologies in data mining. It aims at associating physical or abstract subjects with similar subjects. Jiang et al. proposed the use of the k-means clustering algorithm with MapReduce and realized the transformation of the k-means [17] algorithm by MapReduce [18]. Hong et al. presented the DRICA (Dynamic Rough Increment Clustering Algorithm) [19] as an approach for solving the data updating problem; they ensured the stability of the algorithm and overcame the inefficiency of implementing the algorithm on all data. HDCH (High Dimensional Clustering on Hadoop), which was designed for clustering massive audio data by Liao et al. [20], uses a large cutting granularity in every dimension, thereby implementing clustering efficiently.
Most common classification algorithms are based on prior knowledge and forecast unknown data by extracting a data model [21]. Li et al. proposed the analysis of classification research based on distributed data warehousing [22], which is only available in single machines on GAC-RDB. Because the Internet is becoming increasingly complicated, it is difficult to perform mining network classification accurately. Li et al. proposed an active collaborative method that combined feature selection and a link filtering method [23]. To solve the problem of being unable to process machine-learning data in memory, Liu et al. presented the LS-SVM classification algorithm coordinate descent I _{2} [24], which involves the optimization problem of improving the objective function mode to transform a multi-objective problem into a single objective problem.
Frequent item set mining is the most basic and important procedure of association rule mining, sequence pattern mining, relevance mining, and multi-layer mining [25, 26]. The Apriori mining algorithm, which is based on cloud computing, implements dual binary encoding on the transaction set and the item set based on the MapReduce model. Because the algorithm requires only Boolean AND and OR operations, its efficiency is high [27]. Li et al. proposed the CMFS algorithm [28] based on closed-item-set mining algorithms [29], which uses the constrained maximum-frequent-item-set deep priority strategy and a pruning algorithm. Hong et al. proposed the FIMDS [30] algorithm, which can mine frequent item sets from distributed data. The algorithm uses a frequent-mode tree structure to store data, which facilitates the acquisition of the item-set frequency from each partial mode tree (FP-tree) root.
Outlier mining is one of the main approaches used in data mining. An outlier is a data object that is different from other data objects because it is produced by a different mechanism [31]. Since outlier mining was first proposed, it has been researched continuously. Most outlier mining algorithms in data-intensive computing involve expanding and improving classic outlier mining algorithms. Recently, many traditional outlier detection algorithm based on data flow have been proposed; however, research on applying outlier detection algorithms to data is still in the primary stage. Due to high time complexity, liable response speed has not been achieved, result errors are large, and the accuracy is unsatisfactory [32]. Therefore, research on outlier detection in data-intensive computing environments is of great importance. Pan [33] utilized the SPRINT algorithm, which is based on the Hadoop platform, by employing a parallel method of constructing a decision tree and then solving the parallel problem in the Hadoop platform. In this paper, systematic research on an outlier detection algorithm is carried out. It focuses on integrating and improving an existing algorithm by proposing a coding framework based on MapReduce and the decision-tree classification method MR-DIDC of the SPRINT algorithm, which takes advantage of the outstanding features of MapReduce to make the approach more suitable for data-intensive environments.
2 Basic algorithm analysis
Shafer et al. proposed the SPRINT decision tree algorithm, which is based on SLIQ, in 1996 [34]. The SPRINT algorithm combines the property list and category list. The property list is used to store attribute values, a histogram plot is used to record the category distributions of the partition before and after a specified node, and a hash table is used instead of SLIQ to record the attribute sub-node information of the training tuple. The attribute list and histogram plot data structures do not require storage or memory, which eliminates the size limitation of the memorization capability. The SPRINT algorithm is not only simple, accurate, and fast, it also improves the data structure, which makes mining problems easier to solve.
SPRINT algorithm flow
Function Partition(DataSet S){S:training set} | |
Begin If (all s∈S the same mark) then Return; Foreach a∈A Do calculate column attribute a split S into S_{1} and S_{2} using the best split attribute Partition(S1); Partition(S2); End SPRINT |
The calculation of the best property itemization point and the split point of the property list are the core tasks of the SPRINT algorithm. The SPRINT algorithm uses the Gini index as the property measurement standard. The Gini index performs better and is easier to calculate than the information gain, and Gini index minimization splitting will yield the minimal information gain. The Gini index method is described as follows:
The split with the minimum Gini index is the best split of set T. The inputs for calculating the Gini index are the histogram and the counting matrix.
3 Modified algorithm
3.1 Data structure that is used by the MR-DIDC algorithm
The property list has the same function as in the SPRINT algorithm: it is used to record the distribution information. The training data set is decomposed into independent property lists based on the properties during the initialization period. Each property list consists of property values, category values, and index values. The number of property lists depends on the number of properties. Initially, the root node is used to maintain the whole property list. Each property list that corresponds to a continuous property is pre-ranked and the property lists are split as the number of nodes increases. The split property lists belong to the corresponding sub-nodes.
The piece histogram is a new kind of data structure introduced by this algorithm to assist in the calculation of the split property and split nodes. The form of the piece histogram is similar to that of the histogram; however, the piece histogram is used to record the total record number of data chunks in the property list, which is shown in part A of Fig. 6.
3.2 MR-DIDC algorithm description
Build Classifier program flow
Require: NodeQueue NQ, TreeModel TM, Training record | |
(x,y) ∈D, Attribute set Att 1. T_{root} = new Node 2. Initiate(T_{root}, D,Att) 3. TM = T_{root} 4. NQ.push_back(T_{root}) 5. BuildTree(NQ) |
Program flow of BuildTree
Require: NodeQueue NQ, TreeModel TM, Training record | |
(x,y) ∈D 1. For each T_{curr}∈NQ do 2. If JudgeLeaf(T_{curr}) is false then 3. bestSplit=FindBestSplit(T_{curr}) 4. T_{curr}→splitAtt=bestSplit→splitAtt 5. If bestSplit→splitAtt is category then 6. T_{curr}→leftAttSet=bestSplit→leftAttSet 7. T_{curr}→rightAttSet=bestSplit→rightAttSet 8. Else 9. T_{curr}→splitValue=bestSplit→splitValue 10. parationTrainingSet(T_{curr}→D, leftD, rightD) 11. remove(T_{curr}→splitAtt) 12. Create new nodes T_{left}, T_{right} 13. Initiate(T_{left}, leftD,Att) 14. Initiate(T_{right,} rightD,Att) 15. T_{curr}→left=T_{left} 16. T_{curr}→right=T_{right} 17. NQ.push_back(T_{left}) 18. NQ.push_back(T_{right}) 19. Else 20. T_{curr}→isLeaf = true 21. T_{curr}→label =y //y is the most common label |
Each non-leaf sub-node has only two branches, and only binary splitting is performed in the decision tree model. In the course of building the decision tree, T _{curr} denotes the tree nodes that are extracted from the global node queue and processed sequentially. First, we need to check whether T _{curr} is a leaf node. If the training data tuple T _{curr} is “pure” (all three components belong to the same class) or the number of training data tuples reaches the threshold value, which is defined by the user, T _{curr} is labeled a leaf node. Under the first condition, each node is labeled according to the class to which it belongs. Under the second condition, the number of categories to which each node belongs is used as a label. The third to tenth steps comprise the core of the decision tree modeling algorithm, which will be discussed in the next section. The core calculates the split nodes and chooses the best split property. T _{curr} is labeled according to best split information and split property. Training set is used to split the property list of T _{curr} into leftD and rightD, which are used to produce the sub-trees T _{left} and T _{right}. Then, T _{left} and T _{right} are added into the global node queue. Nodes in queue NQ are removed iteratively. When the queue is empty, the program ends and the decision tree model is complete.
3.3 Calculation of the best split point
The algorithm for calculating the best split point is similar to the SPRINT algorithm and involves scanning the property list of the continuous property calculation split point or the discrete property technology matrix. This algorithm uses MapReduce technology to realize the parallel processing of the best split point, thereby improving the efficiency of the algorithm. Assume that N Mapper tasks are used to process the scan statistics, so that each Mapper task only processes 1/Nth of the training data set. The information aggregation and split-point calculation are performed through the Reducer, which returns the best split information and the split property.
Mapper task flow of FindBestSplit
Require: Current node T_{curr}, Attribute set Att, Class set Y | |
1. For each A ∈ Att do 2. Class Count array countY for Y 3. Index=firstreCord(Tcurr→D) 4. For all (x,y)∈(T_{curr}→D,Attribute list of A) do 5. county[findY(Y,y)]++ 6. Output((findA(A)),(Index, countY)) |
Reducer task flow of FindBestSplit
Require: Key k, Value Set V, Attribute set Att, Class set Y | |
1. For All k do 2. If Att[k] is continuous then 3. For all distinct values ∈V do 4. If sameBlock(value [i]) then 5. Output((k),(value [i], sumCount(value [i]))) |
The pseudo code above describes the Mapper tasks during the Map phase of FindBestSplit. Each Mapper gathers the category distribution information of each data chunk independently and outputs a key table, in which the key is the property subscript index of the property index and the values consist of the data chunk series number and the category key value.
In the Reduce phase, the output, histogram, and block histogram are collected. Then, the other group of MapReduce tasks calculate the Gini index. In the Map phase, the histogram and block histogram that were obtained in the Reduce phase are used to calculate the best split point of each available property. In the Reduce phase, we need to collect the results of Map phase to select the best split point and split property.
4 Environment construction of the MapReduce algorithm
Hadoop clustering environment
A. Edit conf/master, by replacing the master hostname (every host has one independent name). The specific commands are as follows: | |
~$cd /home/ administrator /hadoop-0.20.2 ~$gedit conf/master Write the following in the edit window: C0 | |
B. Edit conf/slaves, by adding all hostnames of the slaves. The specific commands are as follows: ~$cd /home/ administrator /hadoop-0.20.2 ~$gedit conf/slaves Write the following in the edit window: C1 C2 C3 C4 C5 C6 C7 C8 Save and close the edit window. | |
C. Editconf/hadoop-env.sh, by setting variable JAVA_HOME to the JDK installation index. The specific commands are as follows: ~$cd /home/ administrator /hadoop-0.20.2 ~$gedit conf/hadoop-env.sh Write the following in the edit window: export JAVA_HOME /usr/local/java/jdk1.7.0_03 Save and close the edit window. | |
D. Allocatecore-site.xml file. The specific commands are as follows: ~$cd /home/ administrator /hadoop-0.20.2 ~$gedit conf/core-site.xml Write the following in the edit window: Save and close the edit window. | |
E. Allocatehdfs-site.xml file. The specific commands are as follows: ~$cd /home/ administrator /hadoop-0.20.2 ~$gedit conf/hdfs-site.xml Write the following in the edit window: Save and close the edit window. | |
F. Allocatemapred-site.xml file. The specific commands are as follows: ~$cd /home/ administrator /hadoop-0.20.2 ~$gedit conf/mapred-site.xml Write the following in the edit window. |
The installation and allocation of Master and Slave in Hadoop are the same, and it is only necessary to complete the installation and allocation in Master. Then, the results are transferred to every corresponding folder of Slave through SSH no-password public key authentication (in this paper, the decompressions are performed in the administrator folder). The files of Hadoop allocation are stored in the folders of /hadoop/conf. The allocation of core-site.xml, hdfs-site.xml, and mapred-site.xml is necessary for version 0.2 and above. The master file and slave files need to be allocated by writing to the node machine the names of the Master and Slaves. Eventually, hadoop-env.sh is allocated, by writing to the environment variable JAVA_HOME. The value is the root of node machine JDK. The allocation process of Hadoop for Master nodes is as follows:
After finishing the file allocation process, the Hadoop task of Master node allocation is complete. The Hadoop nodes in Slaves do not need to be allocated independently. It is only necessary to copy the Hadoop files into Master to obtain the customer indices of the Slave nodes through SSH no-password public key verification and complete the allocation process. The specific commands are as follows:
~$scp -r /home/ administrator /hadoop-0.20.2 ubuntu@C1:/home/administrator/
The above command copies the allocated Hadoop location to SlaveC1 using Master; the allocated Hadoop locations are copied to other Slave machines similarly. In particular, the allocated locations of Hadoop and JDK in Master have a one-to-one correspondence with those of Slave. However, they are not all the same. Therefore, the JAVA_HOME value must be allocated based on the corresponding location.
5 Case study
The UCI database is one of the main data sources for data mining. The KDD Cup 1999 data set in the UCI database is used in this experiment, which is made up of 4,000,000 records and 40 properties, among which 34 are continuous properties, 5 are discrete properties, and 1 is a class-labeled property (discrete property).
To analyze the performance of the MR-DIDC algorithm, we evaluate the time efficiency, scalability, parallelism, and accuracy. The experimental data are the mean values of repeated experiments. The operation time consists of the algorithm operation time, I/O communication time, and data pre-processing time.
In experiment 1, the calculation times of SPRINT and MR-DIDC are compared, to evaluate the time performance and test the scalability of the MR-DIDC algorithm.
Operation times for training data sets of different sizes (unit: seconds)
Sample size | SPRINT algorithm | MR-DIDC algorithm |
---|---|---|
4,000,000 | 102.3 | 176.5 |
197.4 | 310.3 | |
305.5 | 391.8 | |
412.5 | 452.6 | |
504.8 | 516.9 | |
611.1 | 552.4 | |
766.4 | 594.7 | |
977.2 | 619.8 |
In experiment 2, the parallelization performance of the MR-DIDC algorithm is evaluated.
Total operation time trends of the MR-DIDC algorithm
Sample size | Number of data nodes | Total operation time |
---|---|---|
4,000,000 | 2 | 857.6 |
3 | 664.0 | |
4 | 526.8 | |
5 | 431.7 | |
6 | 385.1 | |
7 | 356.9 | |
8 | 327.4 |
In experiment 3, the trends of test algorithm split-point calculation time and property-list splitting are examined.
Trends of algorithm property-list splitting time and split-point calculation time
Sample size | Property-list splitting time | Split-point calculation time |
---|---|---|
4,000,000 | 825.44 | 102.5 |
643.57 | 174.54 | |
504.45 | 234.49 | |
421.86 | 295.32 | |
342.26 | 328.95 | |
281.61 | 366.81 | |
209.74 | 394.71 | |
156.65 | 427.49 |
Experiment 4 is the testing and accuracy comparison of the MR-DIDC algorithm and the SPRINT algorithm.
Compared accuracies of MR-DIDC and SPRINT
Sample size | Number of data nodes | SPRINT | MR-DIDC |
---|---|---|---|
3,000,000 | 2 | 0.56 | 0.54 |
3 | 0.65 | 0.63 | |
4 | 0.7 | 0.69 | |
5 | 0.73 | 0.74 | |
6 | 0.76 | 0.78 | |
7 | 0.79 | 0.8 | |
8 | 0.81 | 0.83 | |
9 | 0.82 | 0.84 |
To conclude, from the results, the time efficiency of the algorithm will improve with the same MR-DIDC accuracy. The algorithm has good scalability, and it could satisfy the needs of massive data. However, the structure of the algorithm is complicated and that complexity becomes the performance restriction.
6 Conclusions
With the development of big data, mining useful information has become a subject of interest. Data-intensive environments have been considered only in the context of big data mining research. Current research on data mining algorithms in data-intensive calculation environments have concentrated on improving traditional large-scale clustering algorithms. In this paper, a decision tree classification algorithm called MR-DIDC is introduced that is based on the SPRINT algorithm and the MapReduce calculation framework and that is suitable for data-intensive calculations. We tested the performance of the MR-DIDC algorithm experimentally. The results show that the MR-DIDC algorithm has good scalability and a high level of data availability. The running time for large-scale clustering is reduced when there are large amounts of data.
Declarations
Acknowledgements
We gratefully acknowledge the International Center for Informatics Research, Beijing Jiaotong University, China, which provided the simulation platform.
Availability of data and materials
Data were collected from the UCI database.
Funding
The study is supported by a project funded by Beijing Natural Science Foundation (041501108), the China Postdoctoral Science Foundation (2016M591194), and the National Natural Science Foundation (71132008,71390334). We greatly appreciate their support.
Authors’ contributions
The proposed algorithm was designed to be suitable for data-intensive calculations. We tested the performance of the MR-DIDC algorithm experimentally and proved that the MR-DIDC algorithm has good scalability and a high level of data availability. CT collected the data, LS planned and conducted the experiments, and GH analyzed the results. GD wrote the paper and we all approved the paper.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- R.E. Bryant. Data-intensive supercomputing: the case for DISC. Technical report CMU-CS-07-128, Available as http://repository.cmu.edu/compsci/258/.
- C Liu, H Jin, W Jiang, H Li, Performance optimization based on MapRecuce. J. Wuhan Univ. Technol. 20(32) (2010)Google Scholar
- Pacific Northwest National Laboratory.Data intensive computing project overview. https://www.pnnl.gov/publications/results.asp.
- Yang, Z., Awasthi, M., Ghosh, M., & Mi, N. (2016). A fresh perspective on total cost of ownership models for flash storage in datacenters. In Cloud computing technology and science (CloudCom), 2016 IEEE International Conference on (pp. 245–252). IEEEGoogle Scholar
- Yang, Z., Tai, J., Bhimani, J., Wang, J., Mi, N., & Sheng, B. (2016). GReM: dynamic SSD resource allocation in virtualized storage systems with heterogeneous IO workloads. In Performance Computing and Communications Conference (IPCCC), 2016 IEEE 35th International (pp. 1–8). IEEE.Google Scholar
- Roemer, J., Groman, M., Yang, Z., Wang, Y., Tan, C. C., & Mi, N. (2014). Improving virtual machine migration via deduplication. In Mobile Ad Hoc and Sensor Systems (MASS), 2014 IEEE 11th International Conference on (pp. 702–707). IEEE.Google Scholar
- J TAI et al., Improving flash resource utilization at minimal management cost in virtualized flash-based storage systems. IEEE Trans. Cloud Comp. 5(3), 537–549 (2017)View ArticleGoogle Scholar
- Yang, Z., Wang, J., Evans, D., & Mi, N. (2016). AutoReplica: automatic data replica manager in distributed caching and data processing systems. In Performance Computing and Communications Conference (IPCCC), 2016 IEEE 35th International (pp. 1–6). IEEE.Google Scholar
- Gong D, Liu S. A holographic-based model for logistics resources integration, Studies in Informatics and Control. 22(4):367-376 (2013)Google Scholar
- Bhimani, J., Mi, N., Leeser, M., & Yang, Z. (2017). FiM: performance prediction for parallel computation in iterative data processing applications. In Cloud Computing (CLOUD), 2017 IEEE 10th International Conference on (pp. 359–366). IEEE.Google Scholar
- Bhimani, J., Yang, Z., Leeser, M., & Mi, N. (2017). Accelerating big data applications using lightweight virtualization framework on enterprise cloud. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE (pp. 1–7). IEEE.Google Scholar
- WANG, Jiayin, et al. eSplash: efficient speculation in large scale heterogeneous computing systems. In: Performance Computing and Communications Conference (IPCCC), 2016 IEEE 35th International. IEEE, 2016. p. 1-8.Google Scholar
- WANG, Jiayin, et al. SEINA: a stealthy and effective internal attack in Hadoop systems. In: Computing, Networking and Communications (ICNC), 2017 International Conference on. IEEE, 2017. p. 525-530.Google Scholar
- GAO, Han, et al. AutoPath: harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. In: Computer Communication and Networks (ICCCN), 2017 26th International Conference on. IEEE, 2017. p. 1-9.Google Scholar
- WANG, Teng, et al. EA2S2: an efficient application-aware storage system for big data processing in heterogeneous clusters. In: Computer Communication and Networks (ICCCN), 2017 26th International Conference on. IEEE, 2017. p. 1-9.Google Scholar
- RT Kouzes, GA Anderson, ST Elbert, et al., The changing paradigm of data-intensive computing. Computer 42(1), 26–34 (2009)View ArticleGoogle Scholar
- Rajashree Dash, Debahuti Mishra, Amiya Kumar Rath, Milu Achrua. “A hybridized K-means clustering approach for high dimensional dataset”, Int. J. Eng. Sci. Technol., vol 2(2), (2010), pp.59-66.Google Scholar
- J Dean, S Ghemawat, MapReduce: a flexible data processing tool[J]. Commun. ACM 53(1), 72–77 (2010)View ArticleGoogle Scholar
- L Hong, K Luo, Rough incremental dynamic clustering method. Comp. Eng. Appl. 47(24), 106–110 (2011)Google Scholar
- Liao S, He Z. HDCH: audio data clustering system in the MapReduce platform. Comput. Res. Dev.. 2011,48(Suppl.):472-475.Google Scholar
- Lee S D, Kao B, Cheng R. Reducing UK-means to K-means: data mining workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on, 2007[C]. IEEE.Google Scholar
- Li W, Li M, Zhang Y, etc. Classification analysis based on distributed data warehouse. Comp. Appl. Res., 2013,30(10):2936-2943.Google Scholar
- Li L, Ouyang J, Liu D etc. Active collaboration classification combining characteristics selecting and link filter. Computer Comput. Res. Dev. 2013,50(11):2349-2357.Google Scholar
- Liu J, Fu J, Wang S etc. Coordinate decend l2normLS-SVM classification algorithm. Mode Identification and Artificial Intelligence.Google Scholar
- R Agrawal, T Imielinski, A Swami, in Proceeding of the ACM SIG-MOD International Conference Management of Date. Mining association rules between sets of items in large databases (Washington DC, 1993), pp. 207–216Google Scholar
- Yan Y, Li Z, Chen H. Frequent items set mining algorithm. Computer Science, 2004,31(3):112-114.Google Scholar
- Q Wu, Apriori mining algorithm based on cloud computing. Comput. Measuring Control. 20(6), 1653–1165 (2012)MathSciNetGoogle Scholar
- Y Li, Q Li, Maximal frequent itemsets mining algorithm based on constraints. Comput. Eng. Appl. 43(17), 160–163 (2007)Google Scholar
- I Zak, H Siao, in Proc 2002 SIAM Int Conf Data Mining(SDM'02 ). CHARM: an efficiental algorithm for closed itemset mining (Arlington,VA, 2002), pp. 457–473Google Scholar
- Hong Y. Distributed sensor network data flow mining algorithm of frequent itemsets. Computer Science 2013, 40(2):58-94.Google Scholar
- Su L, Han W,Zou P, et al. Continuous kernel-based outlier detection over distributed data streams [C]. Proc of Berlin:Springer, 2007:74–85.Google Scholar
- P Wang, D Meng, J Yan, B Tu, Research development of computer programming model of data intensive computing. Comput. Res. Dev. 47(11) (2010)Google Scholar
- TM PAN, in Advances in future computer and control systems. The performance improvements of SPRINT algorithm based on the Hadoop platform (Springer, Berlin Heidelberg, 2012), pp. 63–68View ArticleGoogle Scholar
- Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: a scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases (pp. 544–555).Google Scholar