Local differential privacy for human-centered computing

Human-centered computing in cloud, edge, and fog is one of the most concerning issues. Edge and fog nodes generate huge amounts of data continuously, and the analysis of these data provides valuable information. But they also increase privacy risks. The personal sensitive data may be disclosed by untrusted third-party service providers, and the current solutions to privacy protection are inefficient, costly. It is difficult to obtain available statistics. To solve these problems, we propose a local differential privacy sensitive data collection protocol in human-centered computing. Firstly, to maintain high data utility, the selection of the optimal number of hash functions and the mapping length is based on the size of the collected data. Secondly, we hash the sensitive data, add the appropriate Laplace noise to the client side, and send the reports to the server side. Thirdly, we construct the count sketch matrix to obtain privacy statistics on the server side. Finally, the utility of the proposed protocol is verified by synthetic datasets and a real dataset. The experimental results demonstrate that the protocol can achieve a balance between data utility and privacy protection.


Introduction
With the development of Internet of Things, technology [1], edge and fog nodes [2], mobile phones, smart cars, wearable devices, and sensor networks have increasingly become the sources of big data [3]. Human-centered computing in cloud [4], edge, and fog has become necessary tasks for enterprises and governments [5]. On the one hand, big data collection and analysis can be used to train machine learning models and to understand user group characteristics to improve user experienc e [6]; on the other hand, deriving sensitive data, such as user preferences, lifestyle habits, and location information [7], can result in privacy leaks. Researchers have conducted many studies on how to prevent the disclosure of personal sensitive information [8] and have proposed many privacy protocols [9].
Differential privacy (DP) [10] is a widely studied privacy-preserving model; it requires that the addition or deletion of any one record does not affect the query results. The traditional differential privacy model is deployed on the central server, and data collected from different sources are transformed into aggregate response queries for privacy protection, i.e., the central server publishes query information that satisfies differential privacy. Therefore, differential privacy is widely applied in all aspects of big data collection. For example, the US Census Bureau uses differential privacy for demographics [11].
However, in the data collection phase, there is little oversight over third-party service providers; consequently, privacy leaks frequently occur, which happen on Facebook [12] and Snapchat [13]. Such frequent privacy disclosures have attracted the public's attention, but in practice, it is very difficult to find a trusted third-party aggregator. This difficulty limits the application of traditional differential privacy to a certain extent. Therefore, it is necessary to consider how to ensure that private information is not disclosed when there is no trusted third-party service provider.
As a result of extensive research on differential privacy, local differential privacy (LDP) is based on traditional differential privacy protection [14]. LDP can obtain valuable information by aggregating clients' perturbed reports without obtaining real released data information, and it can prevent untrusted third parties from revealing privacy. LDP can be applied to various data collection scenarios, such as frequency estimation, heavy hitters identification, and frequent itemset mining. Companies in different fields, such as Google [15] and Apple [16], have used LDP protocols to collect users' default browser homepages and search engine settings, which can identify harmful or malicious hijacking user settings [17] and find the most frequently used emojis or words.
However, the LDP model still has shortcomings with respect to big data collection, such as low accuracy, high space-time complexities, and statistical errors. Since different tasks require adopting different LDP protocols in actual applications, to determining the appropriate parameters for each protocol is difficult, which undoubtedly increases the cost of using LDP to protect sensitive data.
To solve these problems, we propose using the count sketch [18] and Laplace mechanism [10] to reduce spacetime complexity and computational overhead and to obtain high data utility under different distributions. The main contributions of this paper are as follows: (i) We design the LDP protocol to provide a controllable privacy guarantee level on the client side that does not require trusted third-party servers; (ii) The proposed protocol solves the problems of large space-time overhead and low data utility and can be applied to different data distributions; (iii) Experiments show that the proposed protocol can provide available statistical information while protecting user data privacy. This paper is organized as follows. First, we describe related works in Section 2 and the background knowledge for this paper in Section 3. Next, we introduce the current protocols for big data collection in Section 4 and propose our method in Section 6. Then, we evaluate our method in Section 6 and analyze the results in Section 7. At last, we make a summary of our work in Section 8.

Related works
Many scholars and enterprises have studied how to apply LDP in cloud, edge, and fog scenarios and how to improve the performance of LDP protocols. For example, Erlingsson et al. [15] propose the RAPPOR protocol, which uses a Bloom filter and random response to implement an LDP frequency estimated in the Chrome browser. In reference [16], Apple's Differential Privacy team proposes using one-hot encoding technology to encode sensitive data and deploy a CMS algorithm for analyzing the most popular emojis and media playback preferences in Safari. Fanti et al. [19] propose the unknown-RAPPOR protocol to estimate frequency without a data dictionary. Ding et al. [20] propose an algorithm to solve the problem of privacy disclosure when repeatedly collecting telemetry data and apply the algorithm to Microsoft-related products. Wang et al. [14] propose the Harmony protocol to achieve LDP protection. These authors compute the numerical attributes means, and the protocol is deployed in the Samsung's system software. Wang et al. [21] use the LDP protocol to answer private multidimensional queries on Alibaba's ecommerce transaction records. LDP has been deployed in the industry, and it has created practical benefits.
One of LDP's important applications is frequency estimation. Wang et al. [22] propose an OLH algorithm to obtain lower estimation error and to reduce the communication overhead in a larger domain; the algorithm can also be used in heavy hitters identification. The Hadamard response [23] uses the Hadamard transform instead of the hash function; this is because the actual calculation of Hadamard entries is easier, the server-side aggregates report faster. Joseph et al. [24] propose a technique that repeatedly recomputes a statistic with the error which leads to the decays of errors; it happens when the statistic changes significantly rather than the current value of the statistic is recomputed. Wang et al [25] introduce a method that adds postprocessing steps to frequency estimations to make them consistent while achieving high accuracy for a wide range of tasks. Most of the LDP protocols for frequency estimation are implemented by random responses, thereby resulting in low accuracy of the estimation results. Thus, the goal of this paper is to investigate the mechanisms that can achieve LDP with high data utility.
Recent LDP studies have also focused on other applications [26,27]. Bassily et al. [28] propose an S-HIST algorithm for histogram release and utilize random projection technology to further reduce the communication cost. Wang et al. [22] propose identifying heavy hitters in datasets under LDP protection. Ren et al. [29] apply the LDP model to solve the privacy problem in the case of collecting high-dimensional crowdsourced data. Wang et al. [30] propose privacy amplification by multiparty differential privacy, which introduces an auxiliary server between the client side and the server side. Ye et al . [31] propose PrivKV, which can estimate the mean and frequency of key-value data, and PrivKVM, which can improve estimation accuracy through multiple iterations. Therefore, many LDP protocols have been proposed to solve privacy issues in cloud, edge, and fog computing scenarios.

Differential privacy
Differential privacy requires that any tuple in the dataset be under a limited impact. For example, for two datasets D and D′ that are different by only one tuple, the attacker cannot infer the sensitive information of a specific data tuple from the query results, so it is impossible to know whether the data of a certain user exist in the dataset. The definition of differential privacy is as follows: Definition 1 ε-differential privacy [10]. Where ε > 0, a randomized mechanism M satisfies ε-differential privacy if for all datasets D and D′ that differ at most one tuple, and all S⊆Range(M),;we have 3.2 Local differential privacy Local differential privacy requires any two tuples to be indistinguishable. For example, for any two tuples x and x′, the attacker cannot infer the sensitive information of a specific data tuple from the query results, so it is impossible to know the specific tuple. εlocal differential privacy is defined as follows: Definition 2 ε-local differential privacy [14]. Where ε > 0, a randomized mechanism M satisfies ε-local differential privacy if and only if for any two input tuples x and x′ in the domain of M, and for any possible output x * of M, we have It can be concluded from the definition of ε-local differential privacy that the output of a randomized mechanism of any pair of input tuples is similar, and therefore cannot be inferred by the specific input tuple. A smaller privacy budget ε ensures a higher privacy level, but repeating the queries for the same tuple will consume ε, thereby decreasing the level of privacy. Therefore, the choice of ε needs to be determined according to the specific scenario.

Sensitivity and Laplace mechanism
Differential privacy implements privacy protection by adding noise to the query results, and the amount of noise added should not only protect user privacy but also maintain data utility. Therefore, sensitivity becomes a key parameter of noise control. In local differential privacy, sensitivity is based on a query function of any two tuples; the following definitions are given.
Definition 3 Local sensitivity [32]. For f: D n → R d and x∈D n , the local sensitivity of f at x (with respect to the l1 metric) is: The notion of local sensitivity is a discrete analog of the Laplacian (or maximum magnitude of the partial derivative in different directions). The Laplace mechanism of differential privacy [10] adds noise that satisfies the Laplace distribution to the original dataset to implement differential privacy protection; therefore, we have the following definition.
Definition 4 Laplace mechanism [33]. An algorithm A takes as input a dataset D, and some ε > 0, a query Q with computing function f: D n → R d , and outputs where the Y i is drawn i.i.d from Lap(LS f (x)/ε), thus obeying the Laplace distribution with the scale parameter (LS f (x)/ε). For ease of expression, we denote Δs as the local sensitivity and Lap(λ) as a random variable that obeys the Laplace distribution of scale λ. The corresponding probability density function is pdfðyÞ ¼ 1=2

System structure
Local differential privacy can be seen as a special case of differential privacy [34]. Compare with the perturbation process in DP, the perturbation process in LDP shifts from the server side to the client side. The privacy leakage threat from untrusted third-party servers is eliminated because trusted third-party servers are not required. This collection consists of the following main parts: The encoding is performed by the client side; each tuple should be encoded into a proper vector to ensure perturbation; The perturbation is performed by the client side, and each piece of encoded data generates a perturbed report by the random function, thereby satisfying the definition of ε-local differential privacy. Then, the client side sends these perturbed reports to the server; The aggregation process is performed by the server side, which aggregates reports from the client side and generates available statistics, as shown in Fig. 1.

Problem setting
To use the LDP model to analyze and protect the collected data, some scholars have proposed many privacy protection schemes when estimating frequency. However, these solutions still have deficiencies, such as high computational overhead and low data utility. Therefore, we propose a modified solution to further improve data utility and algorithm accuracy based on solving existing deficiencies. This random response technique was proposed by Warner et al. [35]. For each piece of collected private data v∈D, the user sends the true value of v with probability p and sends randomly selected value v′ from D\{v} with a probability 1 − p. Assuming that domain D contains d = |D| values, the perturbation function is as follows: Since p/q = e ε , the ε-differential privacy definition is satisfied.

Optimal local hash (OLH)
The OLH protocol was proposed in [22] to address the problem of large category attributes. First, the client-side algorithm maps the user's true value v to a smaller hash value domain g by using a hash function. Then, the algorithm performs a random response to the hash value of this smaller domain. The parameter g is a trade-off for the loss of information between the hashing and randomization step; when g = e ε + 1, the trade-off is optimal. The time complexity of the algorithm is O(logn), and the space complexity is O(nlog|D|).

Randomized aggregatable privacy-preserving ordinal response (RAPPOR)
The RAPPOR protocol [15] is deployed in Google's Chrome browser. In the RAPPOR protocol, the user's real value v is encoded into the bit vector B. When there are numerous category attributes, the protocol causes problems, such as a high communication cost and low accuracy. Therefore, RAPPOR uses the Bloom filter for encoding. The value v is mapped to a different position in the bit vector B using k hash functions, i.e., the corresponding position is set to 1, and the remaining positions are set to 0. After encoding, RAPPOR utilizes a perturbation function to obtain the perturbed bit vector B′.

Hadamard Count Mean Sketch (HCMS)
The Hadamard Count Mean Sketch protocol was proposed by Apple's Differential Privacy Team [16] in 2016 to complete large-scale data collection with LDP and to obtain accurate counts. By utilizing the Hadamard transform, the sparse vector is transformed to send a single privacy bit, so each user just sends one private bit. A certain tuple x sent by a given user belongs to a set of values D; j is a randomly selected index from k hash functions, and l is a randomly selected index from the m bits of the hash map domain. Algorithm 1 shows the client's perturbation process. First, each user initializes the vector v and sets the mapping value of the attribute value d in v at the j-th hash function to 1, and the vector v forms a one-hot vector. Second, the algorithm randomly flips the l-th bit of the vector, denoted as w l ∈{− 1,1}, with a probability of (1/e ε + 1). Finally, the client side sends the report s{w l , j, l} to the server. The time complexity of the algorithm is O(n+kmlog(m)+|D|k), and the space complexity is O(log(k) + log(m) + 1).
Algorithm 2 shows the server aggregation process. First, it takes each report w (i) and transforms it to x (i) . Then, the server constructs the sketch matrix M H and add x (i) to row j (i) , column l (i) of M H . Next, it uses the transpose Hadamard matrix to transform the rows of sketch back. At last, the server estimates the count of entry d∈|D| by debiasing the count and averaging over the corresponding hash entries in M H .

Deficiencies of current protocols
Many current protocols have been proposed to protect privacy, but they still have deficiencies. First, the LDP protocol has very strict requirements for selecting parameters and concerning the size of the data. For example, the choice of parameters k and m in the HCMS algorithm greatly influences data variance and utility, and different tasks need to identify different suitable parameters. Second, the RAPPOR and CMS algorithms have large space-time complexity and a high communication cost. For data collection in cloud, edge, and fog scenarios, this problem will make computation highly inefficient. Third, due to the use of random response techniques, a data value with low frequency can even be estimated as negative. Finally, when privacy-preserving data are from different distributions, data utility varies greatly, and it is difficult to fit these data to different tasks.

Design method to address deficiencies
Because of the shortcomings of current LDP protocols, we use the Laplace mechanism to solve the problem that random response technology requires strict data size and construct the count sketch matrix for aggregation to reduce space-time complexity. The protocol consists of the local perturbator and aggregator.
The local perturbator is designed on the client side to perturb the raw data. When a user generates data, the local perturbator selects a random hash function to encode the data as a one-hot encoding and adds the Laplace noises in the mapping location. Then, the local perturbator sends the report containing the selected hash function index and the noised mapping location to the central server. Since the client-side algorithm satisfies the LDP definition, even if the adversary has the relevant background knowledge and acquires another user's data, the adversary cannot infer which data are the user's data. The process is shown in Fig. 2 a. The aggregator is designed on the server side to aggregate the reports. When the central server receives all the perturbed reports from the client side, the server will aggregate them through an aggregator. The aggregator structures the count sketch matrix and cumulates the number of mapping positions for each attribute value under different hash functions. The server side obtains each data value frequency estimation by matrix count. Estimating the data entry d as an example, the aggregator counts the number of x j 's frequency of the corresponding mapping position under different hash functions and sums up the numbers, as shown in Fig. 2

Methods
For human-centered computing in cloud, edge, and fog scenarios, there are generally many users and one data service provider in the LDP model. Therefore, the proposed protocol designs the client and server algorithm for the users and service providers, respectively. The client-side algorithm perturbs the user's raw data and sends an incomplete report; each user sends the perturbed report to the unique service provider. When the service provider receives the user's perturbed reports, the server-side algorithm aggregates the reports to obtain the available statistics.

Client-side algorithm design
The client-side algorithm is designed to prevent user data leakage by ensuring that the perturbed data obey local differential privacy. The raw data are first encoded by a randomly selected hash function, and then the Laplace mechanism is applied to implement the perturb operation in the hash map location. The parameters passed by the server are used before the algorithm is deployed; these parameters include the privacy budget ε, the number of hash functions k, and the length of hash mapping bit m. According to equation (3), any two different pieces of data have at the most two differences in the one-hot encoding vector v, so the local sensitivity of adding Laplace noise is 2. The report sent by the algorithm includes two parts: a randomly selected hash function index j and a hashed map position with noise l′. For convenience, this algorithm is named the Laplace Count Sketch (LCS) client-side algorithm. The specific steps are shown in Algorithm 3. Line 1 of Algorithm 2 initializes an all-zero vector of length m, and in lines 2-4, the algorithm randomly selects the hash function and hash mapping on vector v, where h j (d) denotes choosing the function j to hash data d. In addition, the mapping position is added with the Laplace noise in lines 5-7. Since the mapping position is an integer, the noise value should be rounded. As is already known, the length of the mapping vector is m. For the added noise mapping position l′, l′ will equal l′ minus m if the value of l′ is greater than or equal to m; else, l′ equals l′ plus m if the value of l′ is less than 0. The algorithm sends the perturbed report s i in line 8. Each user sends the perturbed report with O(1) time complexity and O(k+m) space complexity; therefore, the time complexity of the client-side algorithm is O(n), and the space complexity of the client-side algorithm is O(n(k+m)).

Server-side algorithm design
The server-side constructs the count sketch matrix using the same parameters as those used by the client side after collecting perturbed reports from different users. First, the server-side algorithm constructs an all-zero matrix of size k*m. Second, the algorithm cumulates the index positions for each report position. Third, after constructing the completed count sketch matrix, the algorithm searches the position of each data value corresponding to the row in the matrix in different hash functions and adjusts the sketch counts according to Laplace distribution. Finally, the algorithm computes the counts at these positions to obtain the frequency statistics of each attribute value. The specific steps are shown in Algorithm 4. Line 1 of Algorithm 3 initializes an all-zero matrix of size k*m. In line 2, the algorithm deals with the collected n reports and adds 1 to the index position of the corresponding row and column in the matrix. Then, in line 3, the algorithm uses the count sketch to record the matrix value of each data at the corresponding position of each hash function and estimates the frequency of each attribute value. The time and space complexities are O(n+|D|*k) and O(k*m), respectively, in the server-side algorithm.

Privacy and utility analysis
This section discusses the privacy and utility of the proposed protocol. We first prove that the LCS protocol satisfies the local differential privacy definition and then theoretical analysis of the algorithm's variance, and a smaller variance ensures the higher data utility.
Theorem 1 The LCS protocol satisfies the definition of εlocal differential privacy.
Proof. Given any pair of input tuples x and x′ and any possible output x * , p x is labeled as the probability density function of A(x), and p x′ is labeled as the probability density function of A(x′); compare the probability of these two.
Since the sensitivity Δs = 2, the maximum difference between the values of the functions f(x) and f(x′) is 2; that is, |f(x)−f(x′)| has a value range of [0, 2], so (|f(x) − f(x′)|/Δs) ≤ 1. Therefore, the definition of local differential privacy is satisfied.
We infer the variance of the LCS algorithm and denote the estimated frequency f′ (d), the real frequency f(d), and Eð f 0 ðdÞÞ ¼ f ðdÞ.
The larger the parameters k and m are, the smaller the variance and the higher the utility are. However, spacetime complexity increases when k and m are very large.
6 Experimental section

Experimental datasets
The experiments use three datasets: two synthetic datasets and one real dataset; all datasets are a onedimensional classification attribute. The real dataset uses the 2017 Integrated Public Use Microdata Series [36] (US) and selects the education level EDU attribute, which has 25 data categories; we extract 1% from the dataset and take the first million pieces of data as the experimental dataset. The synthetic dataset satisfies the uniform distribution and the Zipf distribution. The parameter a of the Zipf distribution is set to 1.2, and each synthetic dataset contains 100,000 pieces of data.

Experimental competitors
OLH is a better choice for our experiment as the comparison protocol because it gives near-optimal utility when the communication bandwidth is reasonable. In addition, we choose HCMS as another comparison protocol, which reduces the communication overhead by sending a single private bit at a time.

Experimental implementation
These protocols were implemented in Python 3.7 with NumPy and xxhash libraries and were performed on a PC with Intel Core i7-7700hq CPU and 16 GB RAM. Each experiment was repeated ten times to reduce the influence of contingency on the experimental results.

Experimental metrics
To analyze the utility of our protocol for different parameters and scenarios, we compare the error between the true distribution and the estimated distribution for frequency using the mean absolute percentage error (MAPE). For each data value, we calculate the absolute value between the estimated and true frequency, divide the absolute value by the true frequency, then cumulate these values and divide by the size of the data value domain. The definition of MAPE is as follows: where |D| is the category attribute domain size, y i is the real frequency of the i-th attribute value, and x i is the estimated frequency of the i-th attribute value. The smaller the MAPE value is, the closer the estimated distribution is to the real distribution, and the better the data utility is.

Effects of privacy budgets
We validate the effects of privacy budgets parameter ε on the data utility of the LCS protocol through experiments, select the HCMS and OLH protocols as the control group and select the uniform and Zipf synthetic datasets, in which the classification attribute domain value is 50.
For the uniform dataset, we select the number of hash functions k = 128 and the size of the hash map length m = 128 and verify the MAPE values of the three protocols with the variation of the privacy budget ε. As shown in Fig. 3 a, as the privacy budget ε increases, the MAPE value of these three protocols decreases; that is, the data utility increases when the privacy budget increases. The data utility of the LCS protocol is significantly better than that of HCMS and slightly lower than that of OLH  when ε < 2.5 and higher than the other protocols when ε > 2.5. Then, we adjust the parameters of the upper group experiments and verify the effects of the privacy budget ε on the synthetic dataset satisfying the Zipf distribution; we select k = 256 and m = 512, as shown in Fig. 3 b. The data utility of LCS is superior to that of the HCMS protocol throughout the experiments, and the data utility of LCS is marginally lower than that of the OLH protocol when ε > 3.
Next, we validate the effects of privacy budget ε on utility in a real dataset and perform experiments on the IPUMS dataset; we choose parameter k = 256 and m = 128. In Fig. 3 c, the experimental result shows that the utility of the LCS protocol is higher than that of HCMS and lower than that of OLH. It is verified that the  protocol is also feasible in practical applications, and the utility is better than that of HCMS.

Effects of data sizes
The current LDP protocols exhibit dramatic changes in data utility when collecting data of different sizes. To verify the utility of the LCS protocol at different sizes of data, we compare the LCS protocol with the HCMS and OLH protocols with uniformly distributed synthetic data while varying data sizes. The parameters are set to m = 128, k = 1024, and ε = 2. As shown in Fig. 4 a, when the data size n is small, the OLH and HCMS protocols have large errors, and the utility of these protocols is very low, such as when n = 1000; thus, the HCMS and OLH protocols are not usable. However, LCS still maintains better data utility at different data sizes. Next, we verify the utility variation under different data sizes in the synthetic datasets satisfying the Zipf distribution, adjust the parameter settings to k = 256, m = 512, and ε = 2 and select the HCMS and OLH protocols for comparison. In Fig. 4 b, HCMS and OLH are not available when the data size is small, and the utility of LCS is better at varying experimental data sizes.

Frequency estimation
When the privacy budget parameter ε is small, too much noise is added during the perturbation process, thereby resulting in the estimated frequency being much lower than the original frequency. We set up experiments to observe the frequency estimation under different datasets. To clearly show the frequency distribution trend and facilitate observation, we calculate the estimated frequency of each attribute value and multiply the original data amount to obtain a more accurate frequency estimate. The parameters of LCS and HCMS are both set to k = 128, m = 1024, and ε = 2; in addition, we choose the Zipf synthetic dataset and the IPUMS dataset as experimental datasets. The domain sizes are 15 and 25, respectively. Figure 5 a shows that the estimated frequencies of the LCS and OLH protocols are close to the true value, and the overall fluctuation of the HCMS protocol is big. Figure 5 b shows that all protocols have fluctuations, but LCS has a small overall fluctuation. When the LDP protocols process different datasets, changes in the size of the data domain |D| also affect the estimated frequency. We utilize different data domain sizes to generate uniformly distributed datasets, and the protocol utility is tested under different domain sizes; we select k = 256, m = 512, and ε = 2. As shown in Fig. 6, the utility of the LCS protocol increases when the data domain size increases, and the HCMS and OLH protocols decrease when the data domain size increases.

Runtime
To verify the time complexity of the proposed protocol, we record its runtime in seconds on the real dataset of different sizes. We choose k = 256, m = 128, and ε = 2. In Table 1, the experimental result shows that the LCS protocol has a shorter runtime than OLH and HCMS in different sizes of the real dataset. Therefore, we can conclude that the LCS protocol has a much lower time complexity.

Conclusion
This paper focuses on human-centered computing in cloud, edge, and fog analyzes the ε-local differential privacy models without a trusted server. However, current LDP protocols have deficiencies in low utility and strict data size requirements. We propose the Laplace Count Sketch protocol, which cannot only protect sensitive data on the client side but also ensure high accuracy and utility, and discuss the reasons for the deficiencies of current LDP protocols. The experimental results show that the proposed protocol has high utility, is suitable for different sizes of datasets, and maintains its utility under different distributions and data domain sizes. The data dictionary for the datasets used in this paper is known; however, the proposed protocol cannot handle datasets with unknown data dictionaries. The next step is to study how to solve these problems, achieve better privacy protection, and protect sensitive data in humancentered computing.