Open Access

An efficient privacy protection in mobility social network services with novel clustering-based anonymization

EURASIP Journal on Wireless Communications and Networking20162016:275

https://doi.org/10.1186/s13638-016-0767-1

Received: 31 May 2016

Accepted: 2 November 2016

Published: 29 November 2016

Abstract

A popular means of social communication for online users has become a trend with rapid growth of social networks in the last few years. Facebook, Myspace, Twitter, LinkedIn, etc. have created huge amounts of data about interactions of social networks. Meanwhile, the trend is also true for offline scenarios with rapid growth of mobile devices such as smart phones, tablets, and laptops used for social interactions. These mobile devices enlarge the traditional social network services platform and lead to a greater amount of mobile social network data. These data contain more private information of individuals such as location, habit, and health condition. However, there are many analytical, sociological, and economic questions that can be answered using these data, so the mobility data managers are expected to share the data with researchers, governments, and/or companies.

Therefore, mobile social network data is badly in need of anonymization before it is shared or analyzed widely. k-anonymization is a well-known clustering-based anonymization approach. However, the implementation of this basic approach has been a challenge since many of the mobile social network data involve categorical data values. In this paper, we propose an approach for categorical data clustering using rough entropy method with DBSCAN clustering algorithm to improve the performance of k-anonymization approach. It has the ability to deal with uncertainty in the clustering process and can effectively find arbitrarily shaped clusters. We will report the proposed approach and discuss the credibility by theoretical studies and examples. And experimental results on two benchmark data sets obtained from UCI Machine Learning Repository show that our approach is second to none among the Fuzzy Centroids, MMeR, SDR and ITDR, etc. with respect to the local and global purity of clusters. Since the clustering algorithm is a key point of k-anonymization for clustering mobile social network data, our experimental results show that our proposed algorithm can be more effective to balance the utility of the mobile social network data and the performance of anonymization.

Keywords

DBSCAN algorithm Rough entropy Cluster purity Categorical data clustering k-anonymization Mobile Social network

1 Introduction

Social network service is an evolving platform that focuses on making and maintaining social network or social relations among people who share some common activities or interests. Popular social network services, such as Facebook, Twitter, Myspace, Linkedin, and many more are the basic carriers of multi-dimensional space which aim at shaping a virtual society for reflecting people’s real life and status in daily life. Besides, with rapid growth of mobile devices, they provide huge processing power to software applications. These devices could supply valuable user information to social network and share geographical location coordinates of the user. However, mobile social network data uploaded into the social network by these mobile devices are very sensitive. It contains lots of private information which lead to privacy leak. Meanwhile, high quality of these mobile social network data is interesting to researchers or companies with many disciplines, such as sociology, psychology, market, or habit research. Therefore, these data need to be more effectively anonymized so as to protect the private information before it gets published.

In recent years, a simple and practical privacy-preserving anonymization [17] were proposed to prevent privacy leak or against identifying individuals. However, a serious issue is that it decreases data quality a lot (information loss) after data is anonymized. k-anonymization is assigning all records into several groups so that each group contains at least k records. The observations in the same groups are similar or identical to the values of their quasi-identifier. Hence, the efficiency and accuracy of assignment affect the information loss and the performance [6] of an anonymous algorithm. Clustering is a useful technique that partitions a set of instances into subsets (called clusters) so that observations in the same cluster are similar to each other. A good clustering algorithm can present high accuracy of assignment. Hence, clustering is a key point of k-anonymization. If we improve the performance of clustering algorithm, we can obtain efficient k-anonymization which can reduce information loss of data [6]. Most of the literature adopt the k-means-based clustering algorithm for k-anonymization [4, 6, 7]. However, k-means and most of clustering algorithms are presented for clustering numerical data using some distance function. Therefore, there has been a big challenging issue for clustering mobile social network data which mostly involve categorical data values. Meanwhile, these data have often no sharp boundary between clusters. Therefore, an algorithm needs to be designed to handle uncertainty in the clustering process. Huang [8] and Kim et al. [9] have proposed some works for applying fuzzy sets in clustering categorical data to solve the uncertainty issue. Shortly afterward, Kumar et al. [10] propose an algorithm (MMeR) which use the basic rough set concepts to handle categorical attribute values as well as the uncertainty of data sets in 2009. In 2011, Panda et al. [5] uses this MMeR algorithm instead of the clustering stage algorithm of OKA [6] and show that MMeR can have a great success to improve the performance of k-anonymization for mobile social network data. In 2011, Tripathy et al. propose termed standard deviation roughness (SDR) [11] for clustering categorical data. In 2015, ITDR [12] proposed by Park et al. and the experimental results demonstrate that it has the highest performance than previous research.

DBSCAN clustering algorithm is proposed by Ester et al. in 1996 [13]. Although it employs distance function for numerical data clustering, it does not specify initial points or cluster number and it can also determine arbitrary shapes. Therefore, for improving the performance of mobile social network categorical data clustering algorithm, we refer to the basic theory of DBSCAN algorithm in our works.

Rough set is a powerful theory proposed by Pawlak in 1982 [14, 15], which is applied to data mining, machine learning, pattern recognition, and feature selection successfully [1619]. The entropy in information theory proposed by Shannon [20] is a useful mechanism for measuring uncertainty in rough sets. Therefore, many papers are presented which combine rough set theory with Shannon’s entropy theory for data labeling and outlier detection [2123]. Especially, Reddy et al. present data labeling method based on cluster purity using relative rough entropy for categorical data clustering [23]. They apply any clustering algorithm to cluster categorical data into several clusters, then use their proposed method to cluster unlabeled data. Their experimental result demonstrates that it obtains a satisfactory performance. We employ their cluster purity theory with another way to design a novel clustering algorithm for the mobile social network categorical data.

In this paper, we propose a rough entropy method with DBSCAN algorithm for clustering mobile social network data to improve the performance of K-anonymization approach. We employ DBSCAN algorithm with rough entropy method, to calculate the purity of cluster, which can handle uncertainty and improve the performance of categorical data clustering. After adding one data point, if cluster purity is decreased to an acceptable level (threshold λ), then we add this point into the cluster. Subsequently, we use the DBSCAN algorithm to recognize next core point and generate new clusters. Finally, after merging several clusters which have common data points, we can get objective clusters. We have succeeded in showing that the proposed method is able to achieve higher local and global purity as compared to Fuzzy Centroids, MMeR, SDR, and ITDR technique [912]. Therefore, our clustering algorithm can improve the efficiency and accuracy of assignment for k-anonymization.

The rest of this paper is organized as follows: in Section 2, we will introduce mobile social network data, rough set theory, and rough entropy, as well as DBSCAN algorithm. Section 3 introduces our proposed method. In Section 4, a data set which is referred to paper [23] is used to illustrate our algorithm. Section 5 compares the performance of our proposed algorithm with other related algorithms by the concept of local and global purity to ensure our algorithm can be more efficiently for mobile social network categorical data clustering. Finally, in Section 6, we will conclude this paper.

2 Related Work

2.1 Mobile social network data

With the rapid development of mobile devices, users upload personal comment and share their location, habit, and emotion into social network service to reflect real life more conveniently. Therefore, these mobile devices enlarge the traditional social network service platform and lead to huge amounts of mobile social network data. These large data are interesting to government or companies for big data analysis. The comment, location, and published time of these mobile social network data can be used to analyze the popularity of tourist attractions with the change of seasons to improve the quality of tourism environment. People’s attention or habit of these data can be used to manufacture popularity products. These data also can be used to analyze the user’s comments or emotion to adjust some policies by government, etc. However, these huge amounts of published data lead to privacy leak and can even be linked to an individual. Hence, how to balance the data quality and data privacy has become a challenging issue before mobility data managers publish these mobile social network data.

k-anonymization is assigning records into several groups so that each group contains at least k records. The observations in the same groups are indistinguishable in their privacy-related attributes (quasi-identifier) so that it cannot be linked to an individual, for protecting private information. Consider the data in Table 1, live location, birth year, citizenship, and nickname are regarded as quasi-identifiers. It can be easy to obtain that there is one person whose nickname is chen, who was born in 1988, and live in Seoul,Gwangjindistrict originally from China. If the table has more detail such as work station, educational background, religion, etc. or other published data which can be linked by quasi-identifiers, we can obtain more information about chen and even identify an individual. Therefore, private information of this user such as the time, location, contents of posted comment, habit, and emotion will be leaked.
Table 1

Mobile social network data

User

Live location

Birth year

Citizenship

Nickname

...

Privacy information

1

Seoul,Gangnam

1980

Korean

kim

 

2

Seoul,Dongdaemun

1985

Korean

park

The time,location,

3

Seoul,Gangnam

1982

Korean

lee

contents of posted

4

Seoul,Dongdaemun

1985

Korean

jin

comment,habit,

5

Seoul,Jongno

1982

Korean

jin

emotion,etc.

6

Seoul,Gwangjin

1988

China

chen

 

The purpose of k-anonymization is to hide or generalize the values of quasi-identifier in the same cluster before mobility data managers publish it so as to protect personal privacy. Therefore, clustering can be treated as a key point of k-anonymization for these mobile social network data.

2.2 Rough set theory

In rough set, a set of data points are stored in a table, this table is referred to as an information system. This information system is defined as a quadruple IS=(U,A,V,f). U is a non-empty finite set of data points. A is a non-empty finite set of attributes. V is the union of attribute values. f:U×AV is an information function which associates a unique value of each attribute with every object belonging to U, such that for any aA and xU,f(x,a)V a . V a is called the value set of attribute a [2426].

Definition 1

For information system. With any of PA there is an associated equivalence relation IND(P) described as follows [ 24 ]:
$$ \text{IND}(P)=\{{(x,y)\in U^{2}: \forall a \in P, f(x,a)=f(y,a)}\} $$
(1)

The relation IND(P) is called a P-indiscernibility relation. The partition of U is a family of all equivalence classes of IND(P) and is denoted by U/IND(P). If (x,y)IND(P), then objects x and y are indiscernible from each other by attributes from P. The equivalence classes of the P-indiscernibility relation are denoted \([x]_{P}^{U}\).

2.3 Rough entropy

The entropy put forward by Shannon [20] as an effective measure of uncertainty has been wildly used for characterizing the information contents in all sorts of fields.

Rough entropy is an extension of entropy to measure the uncertainty in rough sets.

Definition 2

Give an information system IS=(U,A,V,f). For any PA, let IND(P) is the equivalence relation as the form of U/IND(P)={P 1,P 2,,P m }. The rough entropy RE(P) of equivalence relation IND(P) is defined by [23]:
$$ \text{RE}(P)=-\sum_{i=1}^{m}\frac{\left | P_{i}\right |}{\left | U \right |}log_{}\frac{1}{\left | P_{i} \right |} $$
(2)

|P i |/|U| is denotes the probability of any xU being in equivalence class P i . 1≤im and |M|denotes the cardinality of set M.

2.4 DBSCAN algorithm

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Ester et al. in 1996 [13]. Consider a set of points in some space to be clustered. For the purpose of DBSCAN clustering, the points are classified as core point, border point, and noise [27,28] as follows:
  • A point p is a core point if it has more than a specified number of points (MinPts) within ε (Eps).

  • A border point has fewer than MinPts within ε (Eps), but is in the neighborhood of a core point p.

  • A noise point is any point that is not a core point or a border point.

In Fig. 1 [27], we define minPts = 3 and ε (Eps) = 1. Red points are core point, since at least 3 points surround it in a ε radius. Because core points are all reachable from each other, they form one cluster. Point x 7 and x 8 are border points and reachable from point x 5 and x 6, thus, also belong to the cluster. Point x 9 is a noise point.
Fig. 1

Example of DBSCAN algorithm

To summarize, if clusters have a common data point (reachable from each other), these clusters can be merged and generate a new cluster (x 9 is not reachable from any other point, it is a noise point).

3 DBASCN based on rough entropy

Rough entropy method is usually used with data labeling and outlier detection. However, researchers give a few attentions to use it in categorical data clustering. In the following paragraph, we will introduce the procedure that performs categorical data clustering using rough entropy with DBSCAN algorithm.

Definition 3

A ratio between total rough entropy (TotalRE) and maximum total rough entropy (MaxRE) is defined as cluster purity with following format [23]:
$$ \text{Purity}(C_{i})=(\text{TotalRE}(C_{i}))/\text{MaxRE} $$
(3)

The sum of rough entropy on all attributes of cluster can be calculated as the total rough entropy.

Definition 4

The maximum total rough entropy (MaxRE) of any cluster C i for any attribute is defined as MaxRE(C i )=K×log(m). Where, K is the attribute number of data points and m is the number of data points in cluster C i [23].

This section we will introduce our algorithm which is called RE-based DBSCAN algorithm. The related definitions have been discussed in the previous section. DBSCAN algorithm can automatically find core point and generate clustering based on these points. Since the distance function is not suitable for categorical data, we employ the variation of cluster purity (calculated by rough entropy) to instead of parameter ε (Eps).

Some of social network data may contain a few numerical values. Therefore, we can adopt the method of calculating means between maximum and minimum value to group numerical values into several intervals. Hence, the numerical values can be converted to categorical values.

After DBSCAN algorithm recognizes one core point, an information system IS=(U,A,V,f) can be obtained. Although, this system only has one data point in U (regarded as cluster C i ). After sorting the remaining data points according to the corresponding cluster purity, we can obtain one sorted table (x 1,x 2,,x k ,,x n ) (the cluster purity is calculated by one core point and one of remaining point). Since clustering algorithm is the task for partitions a set of instances into groups, the observations in the same groups are most similar to each other. It does not mean that objects should be identical. Therefore, for any other data point x k of the sorted table, add x k into cluster C i if cluster purity is decreased to an acceptable level (threshold λ), add data point x k into cluster C i . Otherwise, DBSCAN algorithm recognizes next core point and generate cluster by the above method. Finally, after merging several clusters which have common data points, we can get objective clusters. If the minimum point number of one cluster is smaller than a threshold value v, then we can merge this cluster into another new cluster, which has the highest cluster purity.

RE-based DBSCAN algorithm treats every point as core point and estimates the cluster purity to make clusters. The evaluation criteria λ (ε for the DBSCAN algorithm) is equal for every point. Analogous to DBSCAN algorithm, if this value of λ is too large, we will suffer from the overfitting problem. However, our algorithm use the cluster purity to partition the clusters. We can let the threshold λ=N×log(m)/K×log(m)(NM). If N=K,λ is equal to 1, it means all values of all attributes are identical. Otherwise, all values of all attributes are different when N=0. Since we do not know the actual cluster number of the unseen data set, we can guess the distribution of clusters to control the threshold value λ so as to avoid the overfitting problem and obtain the object clusters.

4 Certification by example

In this section, we make an example of mobile social network data to facilitate understanding and to ensure the reliability of our clustering algorithm. As shown in Table 2, there are some information of mobile social network data, such as a is live location, b is birth year, c is citizenship.
Table 2

A data set with 14 data points

User

Live location

Birth year

Citizenship

Privacy information

x 1

Korea,Seoul

1988

Korean

 

x 2

Korea,Busan

1988

Korean

 

x 3

Korea,Seoul

1988

Chinese

 

x 4

Korea,Seoul

1988

Chinese

 

x 5

Germany,Berlin

1988

German

The time, location,

x 6

Germany,Munich

1988

German

contents of posted

x 7

Germany,Berlin

1999

German

comment,habit,

x 8

Korea,Seoul

1988

Korean

emotion,etc.

x 9

Korea,Busan

2000

American

 

x 10

Germany,Munich

1999

German

 

x 11

Korea,Incheon

2000

American

 

x 12

Germany,Munich

1988

German

 

Given an information system IS=(U,A,V,f), where U={x 1,x 2,,x 14}, A={a,b,c}. Let threshold λ = 0.67 and v = 2.

For core data point x 1 of DBSCAN, calculate the cluster purity between x 1 and the rest points, and then sorting these points by cluster purity values.

In Table 3, for data point x 1 and x 2, the partitions induced by all singleton subsets of a A are calculated by formula 1;
$$U/\text{IND}(a)=\{\{x_{1}\}, \{x_{2}\}\} $$
$$U/\text{IND}(b) = \{\{x_{1}, x_{2}\}\} $$
$$U/\text{IND}(c) = \{\{x_{1}, x_{2}\}\} $$
Table 3

Data points in cluster with two unsorted data points

User

Live location

Birth year

Citizenship

x 1

Korea,Seoul

1988

Korean

x 2

Korea, Busan

1988

Korean

According to formula 2, the rough entropy of each attribute of attribute set A can be calculated as follows:
$${}\begin{aligned} \text{RE}(a)=-\sum_{i=1}^{m}\frac{\left | P_{i}\right |}{\left | U \right |}\text{log}_{}\frac{1}{\left | P_{i} \right |} = -\left(\frac{1}{2}\text{log}\frac{1}{1}+\frac{1}{2}\text{log}\frac{1}{1}\right) = 0 \end{aligned} $$
$${}\begin{aligned} \text{RE}(b)=-\sum_{i=1}^{m}\frac{\left | P_{i}\right |}{\left | U \right |}\text{log}_{}\frac{1}{\left | P_{i} \right |} = -\left(\frac{2}{2}\text{log}\frac{1}{2}\right) = 0.3010 \end{aligned} $$
$${}\begin{aligned} \text{RE}(c)=-\sum_{i=1}^{m}\frac{\left | P_{i}\right |}{\left | U \right |}\text{log}_{}\frac{1}{\left | P_{i} \right |} = -\left(\frac{2}{2}\text{log}\frac{1}{2}\right) = 0.3010 \end{aligned} $$
The maximum total rough entropy of cluster C i is calculated by Definition 4:
$$\text{MaxRE} = K\times \text{log}(m) = 3\text{log}(2) = 0.9031 $$
The total of rough entropy of two data points is calculated as follows:
$$\text{TotalRE} = \text{RE}(a)+\text{RE}(b)+\text{RE}(c) = 0.602 $$
Therefore, the purity of two data points is calculated by using formula 3 as follows:
$$\text{Purity}(C_{i})=(\text{TotalRE}(C_{i}))/\text{MaxRE} = 0.667 $$

Analogously, we can find that cluster purity values between x 1 and the rest of the data points are (0.667, 0.667, 0.667, 0.333, 0.333, 0, 1.0, 0, 0, 0, and 0.333). According to these cluster purity values, we sort data points as follows: (x 8,x 2,x 3,x 4,x 5,x 6,x 12,x 7,x 9,x 10, and x 11).

Take x 1 as core point of DBSCAN, after adding data point x 8, the data points of cluster C i are shown in Table 4:
Table 4

Data points in cluster with two sorted data points

User

Live location

Birth year

Citizenship

x 1

Korea,Seoul

1988

Korean

x 8

Korea,Seoul

1988

Korean

Analogously, we calculate the cluster purity as 1.0. It means all points in the cluster C i is same with each other. The cluster is “clean”. Therefore, add data point x 8 into cluster C i .

After adding data point x 2, the cluster C i is: U={x 1,x 8,x 2}. Correspondingly, we can obtain that the cluster purity is 0.807 and it is bigger than threshold λ. Therefore, add data point x 2 into cluster C i . Similarly, continue to add sorted data points until the cluster purity is smaller than threshold λ or cluster purity is increased in cluster C i . After adding data point x 3, the data points of cluster C i are shown in Table 5:
Table 5

Data points in cluster with 4 sorted data points

U

Live location

Birth year

Citizenship

x 1

Korea,Seoul

1988

Korean

x 8

Korea,Seoul

1988

Korean

x 2

Korea,Busan

1988

Korean

x 3

Korea,Seoul

1988

Chinese

The partitions induced by all singleton subsets of aA are calculated by formula 1;
$$U/\text{IND}(a)=\{\{x_{1},x_{8},x_{3}\}, \{x_{2}\}\} $$
$$U/\text{IND}(b) = \{\{x_{1},x_{8},x_{3},x_{2}\}\} $$
$$U/\text{IND}(c)=\{\{x_{1},x_{8},x_{2}\}, \{x_{3}\}\} $$
Analogously, according to formula 2, the rough entropy of each attribute of attribute set A can be calculated as follows:
$${{}{\begin{aligned} \text{RE}(a)= -\! \sum_{i=1}^{m}\frac{\left | P_{i}\right |}{\left | U \right |}\text{log}_{}\frac{1}{\left | P_{i} \right |} = \,-\, \left(\! \frac{3}{4}\text{log}\frac{1}{3}+\frac{1}{4}\text{log}\frac{1}{1}\! \right)\! =\! 0.358 \end{aligned}}} $$
$${{}{\begin{aligned} \text{RE}(b)=-\sum_{i=1}^{m}\frac{\left | P_{i}\right |}{\left | U \right |}\text{log}_{}\frac{1}{\left | P_{i} \right |} = -\left(\frac{4}{4}\text{log}\frac{1}{4}\right) = 0.602 \end{aligned}}} $$
$${{}{\begin{aligned} \text{RE}(c)= -\! \sum_{i=1}^{m}\! \frac{\left | P_{i}\right |}{\left | U \right |}\text{log}_{}\frac{1}{\left | P_{i} \right |} =\! -\! \left(\! \frac{3}{4}\text{log}\frac{1}{3}+\frac{1}{4}\text{log}\frac{1}{1}\right)\! =\! 0.358 \end{aligned}}} $$
The maximum total rough entropy of cluster C i is calculated by Definition 4:
$$\text{MaxRE} = K\times \text{log}(m) = 3\text{log}(4) = 1.8062 $$
The total of rough entropy of two data points is calculated as follows:
$$TotalRE = RE(a)+RE(b)+RE(c) = 1.318 $$
Therefore, the purity of two data points is calculated by using formula 3 as follows:
$$\text{Purity}(C_{i})=(\text{TotalRE}(C_{i}))/\text{MaxRE} = 0.729 > \lambda $$

Therefore, add data point x 3 into cluster C i .

After adding data point x 4, the cluster C i is: U={x 1,x 8,x 2,x 3,x 4}. Correspondingly, we can obtain that the cluster purity is 0.757. Although this value is over the threshold value λ, this value is increased after adding data point x 4 into cluster C i . It means that this point is similar to another points (like x 3) rather than core data point x 1 of DBSCAN method. Therefore, we stop this procedure and we can obtain one cluster that contains {x 1,x 8,x 2,x 3}. Similarly, all clusters of each core point can be found by repeating the above method.

Finally, by DBSCAN algorithm, merge these small clusters if they have common data points. Therefore, the result of our proposed algorithm is C 1{x 1,x 2,x 3,x 4,x 8},C 2{x 5,x 6,x 7,x 10,x 12} and C 3{x 9,x 11}.

The data set in Table 2 is given for estimating the performance of our mobile social network data clustering algorithm. We use rough entropy method to calculate the similarity between data points and employ the basic idea of DBSCAN method to merge the clusters which have common data points to generate target clusters. Our algorithm analyzes the change of cluster purity after attempting to add new data points to the cluster. If the reduction of cluster purity keeps at an acceptable level of λ after adding a data point, then add the data point into the cluster. Analogously, if the number of instances in one cluster is under the minimum threshold value v, this cluster can be merged into other clusters by comparing the cluster purity.

5 Experimental results

In order to prove that our proposed clustering algorithm has a stronger performance so as to balance the utility of data and the efficiency of anonymization, we have implemented the algorithm using JAVA language and tested on several data sets obtained from the UCI Machine Learning Repository which were used in previous clustering works [12]. The local and global purity of clusters is used to measure the quality of the clusters. The global purity of a cluster is defined as:
$$\begin{aligned} \mathrm{Global \; purity=\frac{Number\;of\;data\;occurring\;in\;its\;corresponding\;class}{The\;number\;of\;data\;in\;the\;data\;set}} \end{aligned} $$
The local purity is defined as:
$$\mathrm{Local \; purity}=\frac{\sum_{i=1}^{m}\text{Purity}(i)} {m} $$
m is the number of clusters obtained from the proposed algorithm.

According to the above measure, a higher value of overall purity and local purity indicates a higher performance of clustering algorithm. Therefore, if these two values are 1, it means clustering algorithm is perfect for clustering data into its corresponding cluster. In this section, our proposed algorithm is compared to six algorithms based on Soybean and Zoo data sets to ensure the superiority and efficiency of our algorithm.

Experiment 1. The Soybean data set contains 47 objects. For each data point, the information of soybean diseases is described by 35 categorical attributes. This data set totally represents four kinds of soybean diseases. The data set includes 17 objects for describing Phytophthora Rot disease, 10 objects for describing Diaporthe Stem Canker disease, 10 objects for describing Charcoal Rot disease as well as 10 objects for describing Rhizoctonia Root Rot disease [1012]. Since there are four kinds of diseases, the results based on our algorithm generate four clusters. Table 6 summarizes the results of our algorithm on the Soybean data set.
Table 6

Local Purity of clusters in Soybean data

Clusters

C 1

C 2

C 3

C 4

Sum

Purity

1

10

0

0

0

10

1

2

0

10

0

0

10

1

3

0

0

10

0

10

1

4

0

0

0

17

17

1

Table 6 contains information about actual and predicted data distribution by a clustering algorithm. Each column of the Table (C 1,C 2,...,C 4) represents the instances in an actual cluster, each row (1, 2,..., 4) represents the instances in a predicted cluster by a clustering algorithm in Soybean data set.

As shown in Table 6, all of 47 objects are correctly classified into its corresponding clusters. Thus, the local and global purity of the clusters is 100%. Kim et al. [9], Kumar et al. [10], Tripathy et al. [11], and Park et al. [12] have compared this data set with different categorical data clustering algorithms. In this research, our algorithm is applied to the same data set, and therefore we can compare with these algorithms to check superiority and efficiency of our proposed algorithm.

Figure 2 shows that our algorithm outperforms Fuzzy Centroids, MMeR, and SDR method on the Soybean data set [12]. Our algorithm has the most significant local and global purity which is 1.0, which means all of the clusters is correctly classified. Since the distribution of clusters is not dense in this data set and even the threshold value λ is quite large, it will not suffer from the overfitting problem. Therefore, we can ensure that our algorithm has higher superiority and efficiency than MMeR, SDR algorithm.
Fig. 2

Comparison of Soybean data set with other algorithm [12]

Experiment 2. The Zoo data set contains 101 objects. For each data point, the information of an animal is described by 18 categorical attributes. This data set totally represents seven kinds of animals [1012]. Therefore, the result based on our algorithm needs to generate seven clusters and compare with other algorithms. Table 7 summarizes the results of run our algorithm on the Zoo data set.
Table 7

Local purity of clusters in Zoo data

Clusters

C 1

C 2

C 3

C 4

C 5

C 6

C 7

Sum

Purity

1

27

0

0

0

0

0

0

27

1

2

14

0

0

0

0

0

0

14

1

3

0

0

2

13

0

0

0

15

0.867

4

0

0

0

0

0

0

7

7

1

5

0

20

0

0

0

0

1

21

0.952

6

0

0

0

0

0

8

2

10

0.8

7

0

0

3

0

4

0

0

7

0.571

Similar to Table 6, each column of the Table 7 (C 1,C 2,...,C 7) represents the instances in an actual cluster, each row (1, 2,..., 7) represents the instances in a predicted cluster by a clustering algorithm in Zoo data set.

For 101 objects, 93 objects are correctly classified into its corresponding clusters. Thus, the global purity of the clusters is 92%.

The comparison of global and local purity are showed as follows:

From Fig. 3, it is clear that our algorithm has the highest purities as the local purity is 88.4%, the global purity is 92%, and total purity is 90.2%. Therefore, our algorithm performs better than Fuzzy Centroids, MMeR, SDR, and ITDR algorithm on the Zoo data set [12]. Our algorithm not only has the more significant local purity but also has the more significant global purity. It means we decrease misclassification during running the clustering algorithm.
Fig. 3

Comparison of Zoo data set with other algorithm [12]

From above two examples, we can conclude that our algorithm is the most efficient algorithm for categorical data clustering. Meanwhile, our experiments show that it is more efficient than MMeR, SDR, and ITDR which employ the rough set theory for clustering. Therefore, our clustering algorithm is suitable for k-anonymization and will reduce the information loss and provide more useful mobile social network data to the government or companies.

6 Conclusions

Social networks are growing rapidly with the development of mobile devices. These devices supply valuable user information to social network including the user’s geographical location coordinates. Meanwhile, mobile social network data provide interesting opportunities to researchers and/or companies with many disciplines, such as sociology, psychology, market, or habit research. Therefore, these data are badly in need of anonymization before it gets published by mobility data managers. For balancing the utility of these data and the performance of anonymization, we propose a new clustering method for anonymous mobile social network data. Our algorithm needs no initial points and cluster number from users and it can handle the impreciseness of data. The theoretical studies and experimental results show that our method is more efficient than most of the related algorithms including MMeR, SDR, and ITDR which are the algorithms based on rough set theory. We believe that our approach is useful and credible for anonymization. The further work will focus on gathering high dimension mobile social network data to estimate the feasibility of our proposed approach.

Declarations

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B02011964).

Competing interests

The authors declare that they have no competing interests.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
Department of Internet and Multimedia Engineering, Konkuk University

References

  1. L Sweeney, k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowledge-Based Syst. 10(05), 557–570 (2002).MathSciNetView ArticleMATHGoogle Scholar
  2. L Sweeney, Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowledge-Based Syst. 10(05), 571–588 (2002).MathSciNetView ArticleMATHGoogle Scholar
  3. G Aggarwal, T Feder, K Kenthapadi, S Khuller, R Panigrahy, D Thomas, A Zhu, in Proceedings of the Twenty-fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Achieving anonymity via clustering (ACMNew York, 2006), pp. 153–162.View ArticleGoogle Scholar
  4. X Xu, M Numao, in 2015 Third International Symposium on Computing and Networking (CANDAR). An efficient generalized clustering method for achieving k-anonymization (IEEESapporo, 2015), pp. 499–502.View ArticleGoogle Scholar
  5. G Panda, B Tripathy, S Jha, Security aspects in mobile cloud social network services. Int. J. 2(1) (2011).Google Scholar
  6. J-L Lin, M-C Wei, in Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society. An efficient clustering method for k-anonymization (ACMNew York, 2008), pp. 46–50.Google Scholar
  7. J-W Byun, A Kamra, E Bertino, N Li, in Advances in Databases: Concepts, Systems and Applications. Efficient k-anonymization using clustering techniques (SpringerBerlin, 2007), pp. 188–200.View ArticleGoogle Scholar
  8. Z Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl. Discov. 2(3), 283–304 (1998).View ArticleGoogle Scholar
  9. DW Kim, KH Lee, D Lee, Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit. Lett. 25(11), 1263–1271 (2004).View ArticleGoogle Scholar
  10. P Kumar, B Tripathy, Mmer: an algorithm for clustering heterogeneous data using rough set theory. Int. J. Rapid Manuf. 1(2), 189–207 (2009).View ArticleGoogle Scholar
  11. B Tripathy, A Ghosh, in Recent Advances in Intelligent Computational Systems (RAICS), 2011 IEEE. Sdr: An algorithm for clustering categorical data using rough set theory (IEEETrivandrum, 2011), pp. 867–872.View ArticleGoogle Scholar
  12. I-K Park, G-S Choi, Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf. Syst. 48:, 289–295 (2015).View ArticleGoogle Scholar
  13. M Ester, H-P Kriegel, J Sander, X Xu, in Kdd, 96. A density-based algorithm for discovering clusters in large spatial databases with noise (AAAIPortland, 1996), pp. 226–231.Google Scholar
  14. Z Pawlak, Rough sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982).MathSciNetView ArticleMATHGoogle Scholar
  15. Z Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data vol.9 (Springer Science & Business Media, Berlin, 2012).Google Scholar
  16. TY Lin, N Cercone. Rough Sets and Data Mining: Analysis of Imprecise Data (Springer Science & Business MediaBerlin, 2012).Google Scholar
  17. Y Kaya, M Uyar, A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease. Appl. Soft Comput. 13(8), 3429–3438 (2013).View ArticleGoogle Scholar
  18. BB Nair, V Mohandas, N Sakthivel, A decision tree rough set hybrid system for stock market trend prediction. Int. J. Comput. Appl. 6(9), 1–6 (2010).Google Scholar
  19. Y Qian, J Liang, W Pedrycz, C Dang, Positive approximation: an accelerator for attribute reduction in rough set theory. Artif. Intell. 174(9), 597–618 (2010).MathSciNetView ArticleMATHGoogle Scholar
  20. CE Shannon, A mathematical theory of communication. ACM SIGMOBILE Mobile Comput. Commun. Rev. 5(1), 3–55 (2001).MathSciNetView ArticleGoogle Scholar
  21. F Jiang, Y Sui, C Cao, An information entropy-based approach to outlier detection in rough sets. Expert Syst. Appl. 37(9), 6338–6344 (2010).View ArticleGoogle Scholar
  22. X Li, F Rao, An rough entropy based approach to outlier detection. J. Comput. Inf. Syst. 8(24), 10501–10508 (1050).Google Scholar
  23. HV Reddy, S Viswanadha Raju, P Agrawal, in Advances in Computing, Communications and Informatics (ICACCI), 2013 International Conference On. Data labeling method based on cluster purity using relative rough entropy for categorical data clustering (IEEEMysore, 2013), pp. 500–506.View ArticleGoogle Scholar
  24. L Polkowski, S Tsumoto, TY Lin. Rough set methods and applications: new developments in knowledge discovery in information systems vol.56 Physica (Springer Science & Business MediaBerlin, 2012).Google Scholar
  25. F Jiang, Y Sui, C Cao, A rough set approach to outlier detection[J]. Int. J. General Syst. 37(5), 519–536 (2008).View ArticleMATHGoogle Scholar
  26. F Jiang, Z Zhao, Y Ge, A supervised and multivariate discretization algorithm for rough sets[M]//Rough Set and Knowledge Technology (Springer, Berlin, Heidelberg, 2010).Google Scholar
  27. Visualizing DBSCAN Clustering. http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/. Accessed Jan 2016.
  28. V Panchami, N Radhika, in Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference on The. A novel approach for predicting the length of hospital stay with dbscan and supervised classification algorithms (IEEEBangalore, 2014), pp. 207–212.Google Scholar

Copyright

© The Author(s) 2016