Private and Rateless Adaptive Coded Matrix-Vector Multiplication

Edge computing is emerging as a new paradigm to allow processing data near the edge of the network, where the data is typically generated and collected. This enables critical computations at the edge in applications such as Internet of Things (IoT), in which an increasing number of devices (sensors, cameras, health monitoring devices, etc.) collect data that needs to be processed through computationally intensive algorithms with stringent reliability, security and latency constraints. Our key tool is the theory of coded computation, which advocates mixing data in computationally intensive tasks by employing erasure codes and offloading these tasks to other devices for computation. Coded computation is recently gaining interest, thanks to its higher reliability, smaller delay, and lower communication costs. In this paper, we develop a private and rateless adaptive coded computation (PRAC) algorithm for distributed matrix-vector multiplication by taking into account (i) the privacy requirements of IoT applications and devices, and (ii) the heterogeneous and time-varying resources of edge devices. We show that PRAC outperforms known secure coded computing methods when resources are heterogeneous. We provide theoretical guarantees on the performance of PRAC and its comparison to baselines. Moreover, we confirm our theoretical results through simulations and implementations on Android-based smartphones.

the edge applications, where connectivity to remote servers can be lost or compromised, which makes edge computing crucial.
Edge computing advocates that computationally intensive tasks in a device (master) could be offloaded to other edge or end devices (workers) in close proximity. However, offloading tasks to other devices leaves the IoT and the applications it is supporting at the complete mercy of an attacker. Furthermore, exploiting the potential of edge computing is challenging mainly due to the heterogeneous and time-varying nature of the devices at the edge. Thus, our goal is to develop a private, dynamic, adaptive, and heterogeneity-aware cooperative computation framework that provides both privacy and computation efficiency guarantees. Note that the application of this work can be extended to cloud computing at remote data-centers. However, we focus on edge computing as heterogeneity and time-varying resources are more prevalent at the edge as compared to data-centers.
We assume the following setup throughout the paper. A master device referred to as master wishes to offload intensive computations to n helper devices referred to as workers. The workers cannot be trusted with the privacy of the master's data but will run the required computations. The workers have different computation and network resources that change with time which creates a time-varying and heterogeneous computing environment. Workers that are slow or unresponsive are termed as stragglers. The goal of the master is to assign tasks that are proportional to the overall speed of the workers and tolerate the presence of straggling workers. This will be done by assigning a new task to a worker when it is expected to have finished its previous task.
Example 1 Consider the setup where a master device wishes to offload a task to 3 workers. The master has a large data matrix A and wants to compute matrix vector product Ax . The master device divides the matrix A row-wise equally into two smaller matrices A 1 and A 2 , which are then encoded using a (3,2) Maximum Distance Separable (MDS) code 1 to give B 1 = A 1 , B 2 = A 2 and B 3 = A 1 + A 2 , and sends each to a different worker. Also, the master device sends x to workers and ask them to compute B i x , i ∈ {1, 2, 3} . When the master receives the computed values (i.e., B i x ) from at least two out of three workers, it can decode its desired task, which is the computation of Ax . The power of coded computations is that it makes B 3 = A 1 + A 2 act as a "joker" redundant task that can replace any of the other two tasks if they end up straggling or failing.
The above example demonstrates the benefit of coding for edge computing. However, the very nature of task offloading from a master to worker devices makes the computation framework vulnerable to attacks. One of the attacks, which is also the focus of this work, is eavesdropper adversary, where one or more of workers can behave as an eavesdropper and can spy on the coded data sent to these devices for computations. 2 For example, B 3 = A 1 + A 2 in Example 1 can be processed and spied by worker 3. Even though A 1 + A 2 is coded, the attacker can infer some information from this coded task. Privacy against eavesdropper attacks is extremely important in edge computing [15][16][17]. Thus, it is crucial to develop a private coded computation mechanism against eavesdropper adversary who can gain access to offloaded tasks.
In this paper, we develop a private and rateless adaptive coded computation (PRAC) mechanism. PRAC is (1) private as it is secure against eavesdropper adversary, (2) rateless, because it uses fountain codes [18][19][20] instead of Maximum Distance Separable (MDS) codes 3 [21,22], and (3) adaptive as the master device offloads tasks to workers by taking into account their heterogeneous and time-varying resources. Next, we illustrate the main idea of PRAC through an illustrative example.
Example 2 We consider the same setup in Example 1, where a master device offloads a task to 3 workers. The master has a large data matrix A and wants to compute matrix vector product Ax . The master device divides matrix A row-wise into 3 sub-matrices A 1 , A 2 , A 3 ; and encodes these matrices using a fountain code 4 [18][19][20]. An example set of coded packets is A 2 , A 3 , A 1 + A 3 , and A 2 + A 3 . However, prior to sending a coded packet to a worker, the master generates a random key matrix R with the same dimensions as A i and with entries drawn uniformly from the same alphabet as the entries of A. The key matrix is added to the coded packets to provide privacy as shown in Table 1. Since PRAC is adaptive, it requires the master to divide the tasks into smaller sub-tasks and send them sequentially (over time) to the workers. In particular, at the start of time slot 1, a key matrix R 1 is created and combined with A 1 + A 3 and A 3 , and transmitted to workers 2 and 3, respectively. R 1 is also transmitted to worker 1 in order to obtain R 1 x that will help the master in the decoding process. The computation of (A 1 + A 3 + R 1 )x is completed at the end of time slot 1. Thus, at that time slot the master generates a new matrix, R 2 , and sends it to worker 2. At the end of time slot 2, worker 1 finishes its computation, therefore the master adds R 2 to A 2 + A 3 and sends it to worker 1. A similar process is repeated at the end of time slot 3. Now the master waits for worker 2 to

Table 1 Example PRAC operation in heterogeneous and time-varying setup
The master distributes tasks to the workers in parallel. When a worker finishes its current task, the master assigns it a new task. The table depicts the behavior of the algorithm as a function of the times instances when a worker finishes its task at hand return R 2 x and for any other worker to return its uncompleted task in order to decode Ax . Thanks to using key matrices R 1 and R 2 , and assuming that workers do not collude, privacy is guaranteed. On a high level, privacy is guaranteed because the observation of the workers is statistically independent from the data A.
This example shows that PRAC can take advantage of coding for computation, and provide privacy.
Organization The structure of the rest of this paper is as follows. We summarize our contributions on a high level in Sect. 2. We give a brief overview of the related work in Sect. 3. We present the system model in Sect. 4. Section 5 presents the design of private and rateless adaptive coded computation (PRAC). We characterize and analyze PRAC in Sect. 6. We present evaluation results in Sect. 7. Section 8 concludes the paper.

Results and discussion
We design PRAC for heterogeneous and time-varying private coded computing with colluding workers. In particular, PRAC codes sub-tasks using fountain codes, and determines how many coded packets and keys each worker should compute dynamically over time. We provide theoretical analysis of PRAC and show that it (1) guarantees privacy conditions, (2) uses minimum number of keys to satisfy privacy requirements, and (3) maintains the desired rateless property of non-private fountain codes. Furthermore, we provide a closed form task completion delay analysis of PRAC. Finally, we evaluate the performance of PRAC via simulations as well as in a test bed consisting of real Androidbased smartphones as compared to baselines.
Recently there has been significant interest in applying coding theoretic techniques to speed up machine learning algorithms, as detailed in the next section. While most of the literature has focused on mitigation of slow workers, several recent works consider security on top of it, e.g., [10,[23][24][25][26]. In a recent work, the authors of [27] show that, under certain model assumptions, there are regimes, in terms of data splitting and number of workers used, where offloading tasks to the workers can be faster than doing the computations locally for large dimensional data. In our paper, we assume that we are operating in these large-scale regimes where offloading is needed. In this context, we view the contribution of this paper as posing the problem of bringing and adapting coded computation tools to applications on the edge such as IoT. As such, our contribution includes two parts: (1) In IoT, the resource availability experiences high fluctuation over time. Adapting rateless codes, e.g., [28] to privacy is not immediate due to the need of MDS codes to satisfy the privacy constraints (cf. "Appendix 2"). This paper shows that even though MDS code is needed for privacy, the encoding of the data itself can be arbitrary.
(2) We provide implementation on android devices (phones and tablets) to substantiate the suitability of our proposed algorithm to edge computing.

Related work
Mobile cloud computing is a rapidly growing field with the aim of providing better experience of quality and extensive computing resources to mobile devices [29,30]. The main solution to mobile computing is to offload tasks to the cloud or to neighboring devices by exploiting connectivity of the devices. With task offloading come several challenges such as heterogeneity of the devices, time varying communication channels and energy efficiency, see e.g., [31][32][33][34]. We refer interested reader to [2] and references within for a detailed literature on edge computing and mobile cloud computing.
The problem of stragglers in distributed systems is initially studied by the distributed computing community, see e.g., [35][36][37][38]. Research interest in using coding theoretical techniques for straggler mitigation in distributed content download and distributed computing is rapidly growing. The early body of work focused on content download, see e.g., [39][40][41][42][43]. Using codes for straggler mitigation in distributed computing started in [12] where the authors proposed the use of MDS codes for distributed linear machine learning algorithms in homogeneous workers setting.
Following the work of [12], coding schemes for straggler mitigation in distributed matrix-matrix multiplication, coded computing and machine learning algorithms are introduced and the fundamental limits between the computation load and the communication cost are studied, see e.g., [8,44] and references within for matrix-matrix multiplication, see [4, 7, 10-13, 24, 28, 45-53] for machine learning algorithms and [5,6,9,54] and references within for other topics.
Codes for privacy and straggler mitigation in distributed computing are first introduced in [3,26] where the authors consider a homogeneous setting and focus on matrixvector multiplication. The problem of private distributed matrix-matrix multiplication and private polynomial computation with straggler tolerance is studied [23,[55][56][57][58][59]. In the private matrix-matrix multiplication setting, the master wants to simultaneously maintain the privacy of both matrices which is a generalization of the matrix-vector multiplication setting. The former works are designed for the homogeneous static setting in which the master has a prior knowledge on the computation capacities of the workers and pre-assigns the sub-tasks equally to them. In addition, the master sets a threshold on the number of stragglers that it can tolerate throughout the whole process. In contrast, PRAC is designed for the heterogeneous dynamic setting in which workers have different computation capacities that can change over time. PRAC assigns the sub-tasks to the workers in an adaptive manner based on the estimated computation capacity of each worker. Furthermore, PRAC can tolerate a varying number of stragglers as it uses an underlying rateless code, which gives the master a higher flexibility in adaptively assigning the sub-tasks to the workers. Those properties of PRAC allow a better use of the workers over the whole process. On the other hand, PRAC is restricted to matrix-vector multiplication. Although coded computation is designed for linear operations, there is a recent effort to apply coded computation for nonlinear operations. For example, [25] applied coded computation to logistic regression, and the framework of Gradient coding started in [10] generalizes to any gradient-descent algorithm. Our work is complementary with these works. For example, our work can be directly used as complementary to [25] to provide privacy and adaptive task offloading to logistic regression.
Secure multi-party communication (SMPC) [60] can be related to our work as follows. The setting of secure multi-party computing schemes assumes the presence of several parties (masters in our terminology) who want to compute a function of all the data owned by the different parties without revealing any information about the individual data of each party. This setting is a generalized version of the master/worker setting that we consider. More precisely, an SMPC scheme reduces to our Master/worker setting if we assume that only one party owns data and the others have no data to include in the function to be computed. SMPC schemes use threshold secret sharing schemes, therefore they restrict the master to a fixed number of stragglers. Thus, showing that PRAC outperforms Staircase codes (which are the best known family of threshold secret sharing schemes) implies that PRAC outperform the use of SMPC schemes that are reduced to this setting.
Works on privacy-preserving machine learning algorithms are also related to our work. However, the privacy constraint in this line of work is computational privacy and the proposed solutions do not take stragglers into account, see e.g., [61][62][63].
Private information retrieval [64] is also related to distributed coded computing as noted in [65]. A scheme designed for private matrix-vector multiplication can be used as a PIR scheme where the data of interest is replicated among the servers. Therefore, under some manipulation of the stored data, this scheme can be seen as a PIR scheme with flexible rate.
We restrict the scope of this paper to eavesdropping attacks, which are important on their own merit. We do not consider security against Byzantine attacks, where the malicious (adversarial) workers send corrupted data to the master in order to corrupt the whole computation process. Privacy and security can be achieved by using Maximum Distance Separable (MDS)-like codes which restrict the master to a fixed maximum number of stragglers [23,58]. Our solution on the other hand addresses the privacy problem in an adaptive coded computation setup without such a restriction. In this setup, security cannot be addressed by expanding the results of [23,58]. In fact, we developed a secure adaptive coded computation mechanism in our recent paper [66] against Byzantine attacks. The mechanism in [66] allows the master to detect with high probability the presence of malicious workers, yet it does not ensure privacy of the data. The private and secure adaptive coded computation obtained by combining this paper and [66] is out of scope of this paper.

System model
Setup We consider a master/workers setup at the edge of the network, where the master device M offloads its computationally intensive tasks to workers w i , i ∈ [n], via deviceto-device (D2D) links such as Wi-Fi Direct and/or Bluetooth. The master device divides a task into smaller sub-tasks, and offloads them to workers that process these sub-tasks in parallel.
Task Model We focus on the computation of linear functions, i.e., matrix-vector multiplication. We suppose the master wants to compute the matrix vector product Ax , where A ∈ F m×ℓ q can be thought of as the data matrix and x ∈ F ℓ q can be thought of as an attribute vector. We assume that the entries of A and x are drawn independently and uniformly at random 5 from F q . The motivation stems from machine learning applications where computing linear functions is a building block of several iterative algorithms [67,68]. For instance, the main computation of a gradient descent algorithm with squared error loss function is where x is the value of the attribute vector at a given iteration, x + is the updated value of x at this iteration and the learning rate α is a parameter of the algorithm. Equation (1) consists of computing two linear functions Ax and A T w A T (Ax − y).
Worker and attack model The workers incur random delays while executing the task assigned to them by the master device. The workers have different computation and communication specifications resulting in a heterogeneous environment which includes workers that are significantly slower than others, known as stragglers. Moreover, the workers cannot be trusted with the master's data. We consider an eavesdropper adversary in this paper, where one or more of workers are compromised by an adversary who wants to spy on the coded data sent to these devices for computations. In a comprehensive solution for IoT networks, we assume that standard security protocols (such as datagram transport layer security-DTLS) will be in place, in addition to our proposed scheme, in order to protect the wireless links from eavesdropping from nodes other than the intended workers in the master-worker links. We assume that up to z, z < n , workers can collude, i.e., z workers can share the data they received from the master in order to obtain information about A. The parameter z can be chosen based on the desired privacy level; a larger z means a higher privacy level and vice versa. One would want to set z to the largest possible value for maximum, z = n − 1 security purposes. However, this has the drawback of increasing the complexity and the runtime of the algorithm. In our setup we assume that z is a fixed and given system parameter.
Coding and secret keys The matrix A can be divided into b row blocks (we assume that b divides m, otherwise all-zero rows can be added to the matrix to satisfy this property) denoted by A i , i = 1, . . . , b . The master applies fountain coding [18][19][20] across row blocks to create information packets ν j m i=1 c i,j A i , j = 1, 2, . . . , where the c i,j ∈ {0, 1} . An information packet is a matrix of dimension m/b × ℓ , i.e., ν j ∈ F m/b×ℓ q . Note that information packets can be encoded using any linear code. Fountain codes enable a fluid encoding of the information and allow the master to obtain Ax as long as enough packets are collected from the workers, irrespective of their origin, which makes them a better fit for the heterogeneous and time-varying setting considered in this paper [2]. In order to maintain privacy of the data, the master device generates random matrices R i of dimension m/b × ℓ called keys. The entries of the R i matrices are drawn uniformly at random from the same field as the entries of A. Each information packet ν j is padded with a linear combination of z keys f j (R i,1 , . . . , R i,z ) to create a secure packet s j ∈ F m/b×ℓ q defined as s j ν j + f j (R i,1 , . . . , R i,z ) . We show in "Appendix 2" that encoding the random keys using an MDS code is necessary to guarantee the perfect privacy of the data. Therefore, even though information packets are encoded using Fountain codes, the linear combinations of the random matrices are created using MDS codes.
The master device sends x to all workers, then it sends the keys and the s j 's to the workers according to our PRAC scheme described later. Each worker multiplies the received packet by x and sends the result back to the master. Since the encoding is rateless, the master keeps sending packets to the workers until it can decode Ax . The master then sends a stop message to all the workers. (1) Privacy conditions Our primary requirement is that any collection of z (or less) workers will not be able to obtain any information about A, in an information theoretic sense.
In particular, let P i , i = 1 . . . , n , denote the collection of packets sent to worker w i . For any set B ⊆ {1, . . . , n} , let P B {P i , i ∈ B} denote the collection of packets given to worker w i for all i ∈ B . The privacy requirement 6 can be expressed as H(A) denotes the entropy, or uncertainty, about A and H(A|P Z ) denotes the uncertainty about A after observing P Z .
Delay model We focus on the delays incurred by distributing the tasks to the workers and overlook the time spent on encoding and decoding the tasks 7 at the master. Each packet transmitted from the master to a worker w i , i = 1, 2, . . . , n, experiences the following delays: (1) transmission delay for sending the packet from the master to the worker, (2) computation delay for computing the multiplication of the packet by the vector x , and (3) transmission delay for sending the computed packet from the worker w i back to the master. We denote by β t,i the computation time of the tth packet at worker w i and RTT i denotes the average round-trip time spent to send and receive a packet from worker w i .

Overview
We present the detailed explanation of PRAC. Let p t,i ∈ F m/b×ℓ q be the tth packet sent to worker w i . This packet can be either a key, p t,i = R t,i , or a secure packet p t,i = s t,i = ν t,i + f j (R t,i , . . . , R t,z ) . For each value of t, the master sends z keys denoted by R t,1 , . . . , R t,z to z different workers and up to n − z secure packets s t,1 , . . . , s t,n−z to the remaining workers. The master needs the results of b + ǫ information packets, i.e., ν t,i x , to decode the final result Ax , where ǫ is the overhead required by fountain coding. 8 To obtain the results of b + ǫ information packets, the master needs the results of b + ǫ secure packets, s t,i x = (ν i,j + f j (R t,i , . . . , R t,z ))x , together with all the corresponding. 9 R t,i x, i = 1, . . . , z . Therefore, only the results of the s t,i x for which all the computed keys R t,i x, i = 1, . . . , z, are received by the master can account for the total of b + ǫ information packets.

Dynamic rate adaptation
The dynamic rate adaptation part of PRAC is based on [2]. In particular, the master offloads coded packets gradually to workers and receives two acknowledgements (ACK) from each worker for each transmitted packet; one confirming the receipt of the packet 7 It is worth noting that fountain codes enjoy a linear time encoding and decoding complexity. Other codes such Reed-Solomon codes require inverting a k × k matrix and have decoding complexity of at least O(k log k) if the generator matrix is a Vandermonde matrix and O(k 2 ) in general. 8 We do not account for the probability of decoding failure because we assume that the master receives enough packets to decode with probability one. The overhead of packets required by fountain coding for the master to decode is typically as low as 5% [20], i.e., ǫ = 0.05b. 9 Recall that f j (R t,1 , . . . , R t,z ) is a linear function, thus it is easy to extract (R t,i )x, i = 1, . . . , z , from (f j (R t,1 , . . . , R t,z ))x. by the worker, and the second one (piggybacked to the computed packet) showing that the packet is computed by the worker. Then, based on the frequency of the received ACKs, the master decides to transmit more/less coded packets to that worker. In particular, each packet p t,i is transmitted to each worker w i before or right after the computed packet p t−1,i x is received at the master. For this purpose, the average per packet computing time E[β t,i ] is calculated for each worker w i dynamically based on the previously received ACKs. Each packet p t,i is transmitted after waiting E[β t,i ] from the time p t−1,i is sent or right after packet p t−1,i x is received at the master, thus reducing the idle time at the workers. This policy is shown to approach the optimal task completion delay and maximizes the workers' efficiency and is shown to improve task completion time significantly compared with the literature [2].

Coding
We explain the coding scheme used in PRAC. We start with an example to build an intuition and illustrate the scheme before going into details.
Example 3 Assume there are n = 4 workers out of which any z = 2 can collude. Let A and x be the data owned by the master and the vector to be multiplied by A, respectively. The master sends x to all the workers. For the sake of simplicity, assume A can be divided into b = 6 row blocks, i.e., T . The master encodes the A i 's using fountain code. We denote by round the event when the master sends a new packet to a worker. For example, we say that worker 1 is at round 3 if it has received 3 packets so far. For every round t, the master generates z = 2 random matrices R t,1 , R t,2 (with the same size as A 1 ) and encodes them using an (n, z) = (4, 2) systematic maximum distance separable (MDS) code by multiplying R t,1 , R t,2 by a generator matrix G as follows This results in the encoded matrices of R t,1 , R t,2 , R t,1 + R t,2 , and R t,1 + 2R t,2 . Now let us assume that workers can be stragglers. At the beginning, the master initializes all the workers at round 1. Afterwards, when a worker w i finishes its task, the master checks how many packets this worker has received so far and how many other workers are at this round. If this worker w i is the first or second to be at round t, the master generates R t,1 or R t,2 , respectively, and sends it to w i . Otherwise, if w i is the jth worker ( j > 2 ) to be at round t, the master multiplies R t,1 R t,2 T by the jth row of G, adds it to a generated fountain coded packet, and sends it to w i . The master keeps sending packets to the workers until it can decode Ax . We illustrate the idea in Table 2.
We now explain the details of PRAC in the presence of z colluding workers.
1 Initialization The master divides A into b row blocks A 1 , . . . , A b and sends the vector x to the workers. Let G ∈ F n×z q , q > n , be the generator matrix of an (n, z) systematic MDS code. For example one may use systematic Reed-Solomon codes that use Van- dermonde matrix as generator matrix, see for example [69]. The master generates z random matrices R 1,1 , . . . , R 1,z and encodes them using G. The generation of truly random numbers can be done as in [70,71], where the master harnesses entropy from internal and external sources to guarantee true randomness. Each coded key can be denoted by The master sends the z keys R 1,1 , . . . , R 1,z to the first z workers, generates n − z fountain coded packets of the A i 's, adds to each packet an encoded random key g i R 1 , i = z + 1, . . . n , and sends them to the remaining n − z workers. 2 Encoding and adaptivity When the master wants to send a new packet to a worker (noting that a packet p t,i is transmitted to worker w i before, or right after, the computed packet p t−1,i x is received at the master according to the strategy described in Sect. 5.2), it checks at which round this worker is, i.e., how many packets this worker has received so far, and checks how many other workers are at least at this round. Assume worker w i is at round t and j − 1 other workers are at least at this round. If j ≤ z , the master generates and sends R t,j to the worker. However, if j > z the master generates a fountain coded packet of the A i 's (e.g., A 1 + A 2 ), adds to it g j R t and sends the packet ( A 1 + A 2 + g j R t ) to the worker. Each worker computes the multiplication of the received packet by the vector x and sends the result to the master. 3 Decoding and speed Let τ i denote the number of packets sent to worker i. We define τ max as the largest value of τ i for which the master has received all the R t,i x for all i = 1, . . . , z and t = 1, . . . , τ max . The master can therefore subtract R t,i , t = 1, . . . , τ max and i = 1, . . . , z , from all received secure information packets, and thus can decode the A i 's using the fountain code decoding process. The number of secure packets that can be used to decode the A i 's is dictated by the (z + 1) st fastest worker, i.e., the master can only use the results of secure information packets computed at a given round if at least z + 1 workers have completed that round. Let u t denote the number of workers which completed round t, then the master can use max{u t − z, 0} information packets from round t in the decoding process. In order to decode Ax , the master needs τ max t=1 u t = b + ǫ information packets in total. For example, if the z fastest workers have completed round 100 and the (z + 1) st fastest worker has completed round 20, the master can only use the packets belonging to the first 20 rounds. The reason is that the master needs all the keys corresponding to a given round in order to use the secure information packet for decoding. In Lemma 2

Table 2 Depiction of PRAC in the presence of stragglers
The master keeps generating packets using fountain codes until it can decode Ax . The master estimates the average task completion time of each worker. When a worker is expected to finish the task at hand, the master sends a new task to that worker. This mechanism avoids idle time at the workers. Thus, at every time instance in which a worker is expected to finish its task, the master creates a new packet and sends it to that worker. Each new packet sent to a worker must be secured with a new random key. The master can decode A 1 x, . . . , A 6 x after receiving all the packets not having R 4,1 or R 4,2 in them we prove that this scheme is optimal, i.e., in private coded computing the master cannot use the packets computed at rounds finished by less than z + 1 workers irrespective of the coding scheme.

Privacy
In this section, we provide theoretical analysis of PRAC by particularly focusing on its privacy properties. (2) for a given z < n.

Theorem 1 PRAC is a rateless real-time adaptive coded computing scheme that allows a master device to run distributed linear computation on private data A via n workers while satisfying the privacy constraint given in
Proof Since the random keys are generated independently at each round, it is sufficient to study the privacy of the data on one round and the privacy generalizes to the whole algorithm. We show that for any subset Z ⊂ {1, . . . , n}, |Z| = z , the collection of packets p Z {p t,i , i ∈ Z} sent at round t reveals no information about the data A as given in (2), i.e., H (A) = H (A|p Z ) . Let K denote the random variable representing all the keys generated at round t, then it is enough to show that H(K |A, p Z ) = 0 as detailed in "Appendix 2. " Therefore, we need to show that given A as side information, any z workers can decode the random keys R t,1 , . . . , R t,z . Without loss of generality assume the workers are ordered from fastest to slowest, i.e., worker w 1 is the fastest at the considered round t. Since the master sends z random keys to the fastest z workers, then p t,i = R t,i , i = 1, . . . , z . The remaining n − z packets are secure information packets sent to the remaining n − z workers, i.e., p t,i = s t,i = ν t,i + f (R t,1 , . . . , R t,z ) , where ν t,i is a linear combination of row blocks of A and f (R t,1 , . . . , R t,z ) is a linear combination of the random keys generated at round t. Given the data A as side information, any collection of z packets can be expressed as z codewords of the (n, z) MDS code encoding the random keys. Thus, given the matrix A, any collection of z packets is enough to decode all the keys and H (K |A, p Z ) = 0 which concludes the proof.
Remark 1 PRAC requires the master to wait for the (z + 1) st fastest worker in order to be able to decode Ax . We show in Lemma 2 that this limitation is a byproduct of all private coded computing schemes.
Remark 2 PRAC uses the minimum number of keys required to guarantee the privacy constraints. At each round PRAC uses exactly z random keys which is the minimum amount of required keys (c.f. Equation (12) in "Appendix 2").

Lemma 2 Any private coded computing scheme for distributed linear computation limits the master to the speed of the (z + 1)st fastest worker.
Proof The proof of Lemma 2 is provided in "Appendix 3. "

Task completion delay
In this section, we characterize the task completion delay of PRAC and compare it with Staircase codes [3], which are secure against eavesdropping attacks in a coded computation setup with homogeneous resources. First, we start with task completion delay characterization of PRAC. Proof The proof of Theorem 3 is provided in "Appendix 4. " Now that we characterized the task completion delay of PRAC, we can compare it with the state-of-the-art. Secure coded computing schemes that exist in the literature usually use static task allocation, where tasks are assigned to workers a priori. The most recent work in the area is Staircase codes, which is shown to outperform all existing schemes that use threshold secret sharing [3]. However, Staircase codes are static; they allocate fixed amount of tasks to workers a priori. Thus, Staircase codes cannot leverage the heterogeneity of the system, neither can it adapt to a system that is changing in time. On the other hand, our solution PRAC adaptively offloads tasks to workers by taking into account the heterogeneity and time-varying nature of resources at workers. Therefore, we restrict our focus on comparing PRAC to Staircase codes. Staircase codes assigns a task of size b/(k − z) row blocks to each worker. 10 Let T i be the time spent at worker i to compute the whole assigned task. Denote by T (i) the ith order statistic of the T i 's and by T SC (n, k, z) the task completion time, i.e., time the master waits until it can decode Ax , when using Staircase codes. In order to decode Ax the master needs to receive a fraction equal to (k − z)/(d − z) of the task assigned to each worker from any d workers where k ≤ d ≤ n . The task completion time of the master is expressed as [3] (4) T PRAC ≈ max i∈{1,...,n}

Theorem 4 The gap between the completion time of PRAC and coded computation using staircase codes is lower bounded by
and d * is the value of d that minimizes Eq. (6).
Proof We provide the proof of Theorem 4 in [72].
Theorem 4 shows that the lower bound on the gap between secure coded computation using staircase codes and PRAC is in the order of number of row blocks of A. Hence, the gap between secure coded computation using Staircase codes and PRAC is linearly increasing with the number of row blocks of A. Note that, ǫ , the required overhead by fountain coding used in PRAC, becomes negligible as b increases.
Thus, PRAC outperforms secure coded computation using Staircase codes in heterogeneous systems. The more heterogeneous the workers are, the more improvement is obtained by using PRAC. However, Staircase codes can slightly outperform PRAC in the case where the slowest n − z workers are homogeneous, i.e., have similar compute service times T i . In this case both algorithms are restricted to the slowest n − z workers (see Lemma 2), but PRAC incurs an ǫ overhead of tasks (due to using fountain codes) which is not needed for Staircase codes. In particular, from (5) and (6), when the n − z slowest workers are homogeneous, the task completion time of PRAC and Staircase codes are equal to b+ǫ n−z E[β t,n ] and b n−z E[β t,n ] , respectively.

Simulations
In this section, we present simulations run on MATLAB, and compare PRAC with the following baselines: (1) Staircase codes [3], (2) C3P [2] (which is not secure as it is not designed to be secure), and (3) Genie C3P (GC3P) that extends C3P by assuming a knowledge of the identity of the eavesdroppers and ignoring them. We note that GC3P serves as a lower bound on private coded computing schemes for heterogeneous systems 11 for the following reason: for a given number of z colluding workers the ideal coded computing scheme knows which workers are eavesdroppers and ignores them to use the remaining workers without need of randomness. If the identity of the colluding workers is unknown, coded computing schemes require randomness and become limited to the (z + 1) st fastest worker (Lemma 2). GC3P and other coded computing schemes have similar performance if the z colluding workers are the fastest workers. If the z colluding workers are the slowest, then GC3P outperforms any coded computing scheme. Note that our solution PRAC considers the scenario of unknown eavesdroppers. Comparing PRAC with G3CP shows how good PRAC is as compared to the best possible solution for heterogeneous systems. In terms of comparing PRAC to solutions designed for the homogeneous setting, we restrict our attention to Staircase codes which are a class of secret sharing schemes that enjoys a flexibility in the number of workers needed to decode the matrix-vector multiplication. Staircase codes are shown to outperform any coded computing scheme that requires a threshold on the number of stragglers [3].
In our simulations, we model the computation time of each worker w i by an independent shifted exponential random variable with rate i and shift c i , i.e., . We take c i = 1/ i and consider three different scenarios for choosing the values of i 's for the workers as follows: • Scenario 1 we assign i = 3 for half of the workers, then we assign i = 1 for one quarter of the workers and assign i = 9 for the remaining workers. • Scenario 2 we assign i = 1 for one third of the workers, the second third have i = 3 and the remaining workers have i = 9. • Scenario 3 we draw the i 's independently and uniformly at random from the interval [0. 5,9].
When running Staircase codes, we choose the parameter k that minimizes the task completion time for the desired n and z. We do so by simulating Staircase codes for all possible values of k, z ≤ k ≤ n , and choose the one with the minimum completion time.
We take b = m , i.e., each row block is simply a row of A. The size of each element of A and vector x are assumed to be 1 Byte (or 8 bits). Therefore, the size of each transmitted packet p t,i is 8 * ℓ bits. For the simulation results, we assume that the matrix A is a square matrix, i.e., l = m . We take m = 1000 , unless explicitly stated otherwise. C i denotes the average channel capacity of each worker w i and is selected uniformly from the interval [10,20] Mbps. The rate of sending a packet to worker w i is sampled from a Poisson distribution with mean C i .
In Fig. 1 we show the effect of the number of rows m on the completion time at the master. We fix the number of workers to 50 and the number of colluding workers to 13 and plot the completion time for PRAC, C3P, GC3P and Staircase codes. Notice that PRAC and Staircase codes have close completion time in scenario 1 (Fig. 3a) and this completion time is far from that of C3P. The reason is that in this scenario we pick exactly 13 workers to be fast ( i = 9 ) and the others to be significantly slower. Since PRAC assigns keys to the fastest z workers, the completion time is dictated by the slow workers. To compare PRAC with Staircase codes notice that the majority of the remaining workers have i = 3 therefore pre-allocating equal tasks to the workers is close to adaptively allocating the tasks.
In terms of lower bound on PRAC, observe that when the fastest workers are assumed to be adversarial, GC3P and PRAC have very similar task completion time. However, when the slowest workers are assumed to be adversarial the completion of GC3P is very close to C3P and far from PRAC. This observation is in accordance with Lemma 2. In scenarios 2 and 3 we pick the adversarial workers uniformly at random and observe that the completion time of PRAC becomes closer to GC3P when the workers are more heterogeneous. For instance, in scenario 3, GC3P and PRAC have closer performance when the workers' computing times are chosen uniformly at random from the interval [0. 5,9].
In Fig. 2, we plot the task completion time as a function of the number of workers n for a fixed number of rows m = 1000 and i 's assigned according to scenario 1. In Fig. 2a, we change the number of workers from 10 to 100 and keep the ratio z/n = 1/4 fixed. We notice that with the increase of n the completion time of PRAC becomes closer to GC3P. In Fig. 2b, we change the number of workers from 20 to 100 and keep z = 13 fixed. We notice that with the increase of n, the effect of the eavesdropper is amortized and the completion time of PRAC becomes closer to C3P. In this setting, PRAC always outperforms Staircase codes.
In Fig. 3, we plot the task completion time as a function of the number of colluding workers. In Fig. 3a, we choose the computing time at the workers according to scenario 1. We change z from 1 to 40 and observe that the completion time of PRAC a Scenario 1 with the fastest 13 workers as eavesdropper for GC3P 1 and the slowest workers as eavesdropper for GC3P 2.
b Scenario 2 with 13 workers picked at random to be eavesdroppers. c Scenario 3 with 13 workers picked at random to be eavesdroppers. For each value of m we run 100 experiments and average the results. When the eavesdropper are chosen to be the fastest workers, PRAC has very similar performance to GC3P. When the eavesdroppers are picked randomly, the performance of PRAC becomes closer to this of GC3P when the non adversarial workers are more heterogeneous deviates from that of GC3P with the increase of z. More importantly, we observe two inflection points of the average completion time of PRAC at z = 13 and z = 37 . Those inflection points are due to the fact that we have 12 fast workers ( = 9 ) and 25 workers with medium speed ( = 3 ) in the system. For z > 36 , the completion time of Staircase codes becomes less than that of PRAC because the 14 slowest workers are homogeneous. Therefore, pre-allocating the tasks is better than using fountain codes and paying for the overhead of computations. To confirm that Staircase codes always outperforms PRAC when the slowest n − z workers are homogeneous, we run a simulation in which we divide the workers into three clusters. The first cluster consists of ⌊z/2⌋ fast workers ( = 9 ), the second consists of ⌊z/2⌋ + 1 workers that are regular ( = 3 ) and the remaining n − z workers are slow ( = 1 ). In Fig. 3b we fix n to 50 and change z from 1 to 40. We observe that Staircase codes always outperform PRAC in this setting. In

Fig. 3
Comparison between PRAC and Staircase codes average completion time as a function of number of colluding workers z. We fix the number of rows to m = 1000 . Both codes are affected by the increase of number of colluding helpers because their runtime is restricted to the slowest n − z workers. We observe that PRAC outperforms Staircase codes except when the n − z slowest workers are homogeneous contrast to non secure C3P, Staircase codes and PRAC are always restricted to the slowest n − z workers and cannot leverage the increase of the number of fast workers. For GC3P, we assume that the fastest workers are eavesdroppers. We note that as expected from Lemma 2, when the fastest workers are assumed to be eavesdroppers the performance of GC3P and PRAC becomes very close.

Experiments
Setup The master device is a Nexus 5 Android-based smartphone running 6.0.1. The worker devices are Nexus 6Ps running Android 8.1.0. The master device connects to worker devices via Wi-Fi Direct links and the master is the group owner of Wi-Fi Direct group. The implementation of all the algorithms is done using an Android application written in Java. The master and the workers form a star topology where the master device is the center. The devices are placed approximately 10 inches away from each other. The devices communicate via TCP sockets. TCP sockets are wrapped by data output and input stream to send and receive different data types such as integer, double and bytes. A simple communication protocol is built between the master and the workers. The master firstly sends the size of the data followed by the actual data for multiplication. The worker receives the data, process it and sends the acknowledgement and the results back to the master. The master retrieves the results and estimates the processing delay. To allow parallel processing, the master device utilize multi-threading and deals with each worker by independent threads. We denote this type of thread as "worker thread. " A main thread at the master controls all the worker threads. For PRAC, the main thread records the round of each worker. All the results from each worker thread are reported to the main thread as well.
The master device is required to complete one matrix multiplication ( y = Ax ) where A is of dimensions 60 × 10,000 and x is a 10,000 × 1 vector. We also take m = b i.e., each packet is a row of A. In our implementations the workers are dedicated to the master and do not run background applications or other computing tasks. Therefore, we introduced an artificial delay at the workers following an exponential distribution. The introduced delays serves to emulate real scenarios in which workers would be running other applications in the background. We manipulate the delays in the experiment to analyze the performance of PRAC and other baselines algorithms in the presence of stragglers and validate our theoretical findings. A worker device sends the result to the master after it is done calculating and the introduced delay has passed. Furthermore, we assume that z = 1 i.e., there is one unknown worker that is adversarial among all the workers. The experiments are conducted in a lab environment where there are other Wi-Fi networks operating in the background.
Baselines Our PRAC algorithm is compared to three baseline algorithms: (1) Staircase codes that preallocate the tasks based on n, the number of workers, k, the minimum number of workers required to reconstruct the information, and z, the number of colluding workers; (2) GC3P in which we assume the adversarial worker is known and excluded during the task allocation; (3) Non secure C3P in which the security problem is ignored and the master device will utilize every resource without randomness. In this setup we run C3P on n − z workers. Figure 4 presents the task completion time with increasing number of workers for the homogeneous setup, i.e., when all the workers have similar computing times. Computing delay for each packet follows an exponential distribution with mean µ = 1/ = 3 s in all workers. C3P performs the best in terms of completion time, but C3P does not provide any privacy guarantees. PRAC outperforms Staircase codes when the number of workers is 5. The reason is that PRAC performs better than Staircase codes in heterogeneous setup, and when the number of workers increases, the system becomes a bit more heterogeneous. GC3P significantly outperforms PRAC in terms of completion time. Yet, it requires a prior knowledge of which worker is adversarial, which is often not available in real world scenarios. Now, we focus on heterogeneous setup. We group the workers into two groups; fast workers (per task delay follows exponential delay with mean 2 s) and slow workers (per task delay follows exponential distribution with mean 5 s). Figure 5 presents the completion time as a function of number of workers. In this setup, for the n-worker scenario, there are n 2 fast and n 2 slow workers. The difference between the setups of Fig. 5a, b is that we remove a fast worker (as adversarial) for GC3P in the former, whereas in the latter, we assume that the eavesdropper is a slow worker. As illustrated in Fig. 5, for the 2-worker case, due to the 5% overhead introduced by fountain codes, PRAC performs worse than Staircase code. However, PRAC outperforms Staircase codes in terms of completion time for 3, 4, and 5 worker cases. This is due to the fact that PRAC can utilize results calculated by slow workers more effectively when the number of workers is a We assume a fast worker is adversarial for GC3P.  large. On the other hand, the results computed by slow workers are often discarded in Staircase codes, which is a waste of computation resources. If a fast worker is removed as adversarial for GC3P, the difference between the performance of GC3P and PRAC becomes smaller. This result is intuitive as, in PRAC, the master has to wait for the (z + 1) st fastest worker to decode Ax , which is also the case for GC3P in this setting. In Fig. 6, we consider the same setup with the exception that for the n-worker scenario, there are n 2 slow and n 2 fast workers. Staircase codes perform more closely to PRAC in the 3-worker case as compared to Fig. 5 since the setup of Fig. 6 assumes that the n-z=2 slowest workers are homogeneous, whereas in Fig. 5 the n − z = 2 slowest workers are heterogeneous. Yet, for 5-worker case, PRAC outperforms Staircase codes when comparing to Fig. 5 since PRAC is adaptive to time-varying resources while Staircase codes assigns tasks a priori in a static manner.

Results
Note that in all experiments when n − z slowest workers are homogeneous Staircase codes outperform GC3P and PRAC. This happens because pre-allocating the tasks to the workers avoids the overhead of sub-tasks required by fountain codes and utilizes all the workers to their fullest capacity.

Conclusion
The focus of this paper is to develop a secure edge computing mechanism to mitigate the computational bottleneck of IoT devices by allowing these devices to help each other in their computations, with possible help from the cloud if available. Our key tool is the theory of coded computation, which advocates mixing data in computationally intensive tasks by employing erasure codes and offloading these tasks to other devices for computation. Focusing on eavesdropping attacks, we designed a private and rateless adaptive coded computation (PRAC) mechanism considering (1) the privacy requirements of IoT applications and devices, and (2) the heterogeneous and time-varying resources of edge devices. Our proposed PRAC model can provide adequate security and latency guarantees to support real-time computation at the edge. We showed through analysis, MATLAB simulations, and experiments on Android-based smartphones that PRAC outperforms known secure coded computing methods when resources are heterogeneous. a We assume a fast worker is adversarial for GC3P.  worker w 1 is the fastest and worker w n is the slowest. The previous assumption implies that the results sent from the first z workers contain information about Ax , otherwise the master would have to wait at least for the (z + 1) st fastest worker to decode Ax . By linearity of the multiplication Ax , decoding information about Ax from the results of z workers implies decoding information about A from the packets sent to those z workers. Hence, there exists a set Z ⊂ {1, . . . , n} of z workers for which H(A|P Z ) � = 0 , where P Z denotes the tasks allocated to a subset of z workers, hence violating the privacy constraint. Therefore, any private coded computing scheme for linear computation limits the master to the speed of the (z + 1) st fastest worker in order to decode the wanted result.

Appendix 4: Proof of Theorem 3
The total delay for receiving τ i computed packets from worker w i is equal to where RTT i is the average transmission delay for sending one packet to worker w i and receiving one computed packet from the worker, β t,i is the computation time spent on multiplying packet p t,i by x at worker w i , and the average E[β t,i ] is taken over all τ i packets. The reason is that PRAC is a dynamic algorithm that sends packets to each worker w i with the interval of E[β t,i ] between each two consecutive packets and it utilizes the resources of workers fully [76]. The reason behind counting only one round-trip time (RTT) in T i is that in PRAC, the packets are being transmitted to the workers while the previously transmitted packets are being computed at the worker. Therefore, in the overall delay only one RTT i is required for sending the first packet p 1,i to worker w i and receiving the last computed packet p τ i ,i x at the master. To approximate the total delay, we assume that the transmission delay of one packet is negligible compared to the computing delay of all τ i packets, which is a valid assumption in practice for IoT-devices at the edge. On the other hand, in PRAC, the master stops sending packets to workers as soon as it collectively receives b + ǫ computed packets from the n − z slowest workers (note that b + ǫ is the number of computed packets required for successful decoding, where ǫ is the overhead due to fountain Coding), i.e., n i=z+1 τ i = b + ǫ . Note that the z fastest workers are assigned for computing the keys as described in the previous sections. Due to efficiently using the resources of workers by PRAC, all n − z workers will finish computing τ i packets approximately at the same time, i.e., T PRAC ≈ T i ≈ τ i E[β t,i ], i = z + 1, . . . , n . By replacing τ i with T PRAC E[β t,i ] in n i=z+1 τ i = b + ǫ , we can show that T PRAC ≈ b+ǫ n i=z+1 1/E[β t,i ] . Note that the approximated value approaches the exact value by increasing b. The reason is that the workers' efficiency increases with increasing b Received: 10 May 2020 Accepted: 15 December 2020