Performance and time improvement of LT code-based cloud storage

Outsourcing data on cloud storage services has already attracted great attention due to the prospect of rapid data growth and storing efficiencies for customers. The coding-based cloud storage approach can offer a more reliable and faster solution with less storage space in comparison with replication-based cloud storage. LT codes are the famous member of the rateless code family that can improve performance of storage systems utilizing good degree distributions. Since degree distribution plays a key role in LT codes performance, recently introduced Poisson robust soliton distribution (PRSD) and combined Poisson robust soliton distribution (CPRSD) motivate us to investigate LT code-based cloud storage system. So, we exploit LT codes with new degree distributions to provide lower average degree and higher decoding efficiency, specifically when receiving fewer encoding symbols, compared with popular degree distribution, robust soliton distribution (RSD). In this paper, we show that proposed cloud storage outperforms traditional ones in terms of storage space and robustness encountering unavailability of encoding symbols, due to compatible properties of PRSD and CPRSD with cloud storage essence. Furthermore, a modified decoding process is presented based on required encoding symbols behavior to reduce data retrieval time. Numerical results confirm improvement of cloud storage performance.

reliability and availability in distributed storage systems: replication and coding.Replication is the most straightforward method, distributing copies of data over storage nodes.High storage space requirement is the main drawback of replication-based storage, and it is not suitable for distributed systems with high concurrent access requests.Codingbased storage systems can achieve reliability many orders of magnitude higher than replication-based systems for the same redundancy level [2].There have been plenty of works on erasure codes-based distributed systems [3][4][5][6][7].High decoding complexity and communication cost to repair a corrupted data fragment reduce popularity of this solution.In order to generate a corrupted encoded packet, usually, reconstruction of all the original packets is required.Thus, the communication cost to repair is equal to the size of the entire original.Reduction in repair communication cost can be achieved by network coding-based storage which works through combining encoded packets in healthy nodes.The utilization of Gaussian elimination decoding in network codes and optimal erasure codes makes them inefficient [8][9][10].Near-optimal codes such as LT codes with low complexity encoding and decoding have been proposed for reliable cloud storage systems.If near-optimal codes are exploited, the variation of one original symbol only changes a few numbers of encoded symbols, whereas, in an optimal codes-based system, almost all encoded symbols may be affected by a little bit of modification.Although the number of original packets may be smaller in previous methods, the decoding process performs much faster than in the other storage services [11,12].In this paper, we investigate the performance of LT code-based cloud storage with two recent degree distributions in addition to robust soliton distribution.We study the effects of different parameters value on successful data retrieval.In the following, we propose new algorithm to reduce the retrieval time of user data.The rest of this paper is organized as follows.In Sect.2, related works are reviewed briefly.In Sect.3, we present LT codes descriptions and their degree distributions.Our system model is stated in Sect. 4. We discuss the performance analysis such as parameter selection, simulation results comparison, and time improvement in Sect. 5. Finally, in Sect.6, we bring a conclusion to the paper.

Related works
Coding has been widely used in cloud storage systems to improve their performance in different aspects.Local reconstruction codes (LRC) as a subset of erasure codes are introduced in [5], keeping storage overhead low compared with Reed-Solomon codes in Windows Azure Storage.Significant decrease in bandwidth and I/Os during repair and improvement in latency for large I/Os are achieved by LRC.In [6], a distributed storage system provides fast content downloads by encoding contents with maximum separable codes (MDS) and applying fork-join queuing for user requests.Results show an essential trade-off between expected download time and storage space which can be useful in the design of a system with delay constraints.
Exploiting LT codes with speculative access mechanisms for parallel writing and reading in a distributed storage architecture leads to high and robust performance [11,13,14].Due to introducing symmetric data redundancy and rateless property of LT codes, the proposed system has high flexibility on data access and improvement compared to traditional parallel storage systems.In [12,15], a secure cloud storage service is designed with near-optimal LT codes to solve the reliability issue.The proposed scheme presents efficient data retrieval by exploiting the fast belief propagation decoding algorithm and also utilizes the public integrity verification which helps the data owner to be free from the burden of being online.Proposed scheme presents efficient data retrieval by exploiting the fast belief propagation decoding algorithm and moreover utilizes public integrity verification which helps the data owner be free from the burden of being online.Employing exact repair minimizes data repair complexity and reduces cost.Although the performance analysis and experimental results show an equivalent storage and communication cost in comparison with other erasure codes-based systems, this secure cloud storage service achieves a much faster data retrieval.Compared to network coding-based storage systems, the proposed service reduces storage costs and provides faster data retrieval with comparable communication costs.
In [16], LT-based architecture was proposed for the back-end of block-level cloud (BLCS) storage that achieves sufficient levels of performance in terms of access and transfer, availability, integrity, and confidentiality.Interesting features of LT codes such as low complexity and on-the-fly redundancy setting make them suitable for the BLCS system.Results indicate that by applying appropriate system parameters, good compromise can be achieved, and the proposed BLCS outperforms traditional ones.
The main trade-off between file retrieval delay and successful decoding probability is investigated in a distributed cloud storage system [17].The proposed multi-stage user request scheme plays an efficient role in average retrieval delay reduction.Solving optimization problems for optimal two-stage request scheme determines the proper number of packets requested in the first stage and follows high decoding probability.

LT codes
LT codes [18] are the first class of universal fountain codes.These codes can potentially generate infinite encoded symbols through the XOR operation of a subset of original symbols.For every encoded symbol, a degree d is chosen independently from a given degree distribution.LT codes can recover k original symbols from any k(1 + ǫ) encoded symbols with probability 1 − δ.where ǫ is known as overhead, the number of encoded symbols is equivalent to k + O( √ k.ln 2 (k/δ)) and δ indicates the allowable failure prob- ability of decoding.Belief propagation (BP) is used as an efficient decoding algorithm for these near-optimal codes which depends on degree-1 encoding symbols [19].In LT code-based cloud storage system, a user file is first fragmented into k original symbols, and then, these original symbols are encoded into n encoded symbols.We briefly describe the encoding and BP decoding procedure of LT codes for k = 4 and n = 5 in Figs. 1 and 2, respectively.
A number of original symbols that are combined together is known as code degree, which is designated from two common degree distributions.First, ideal soliton distribution (ISD) ρ(i) is defined as follows (1) One of the main goals in the design of good degree distribution is the ripple size.Ripple is a set of covered original symbols that have not been processed yet.If all the original symbols are covered, retrieval is successful while the process fails when the ripple is not empty at the end of the retrieval.The ripple size should be kept as small as possible to prevent redundant coverage of original symbols.On the other hand, the ripple should be large enough to avoid the disappearance of the ripple until the end of the process.This ideal size is called the expected ripple size that is too small with ISD, also fragile confronting any variation.Although ISD performs weekly in practice due to its ripple expected size that is one, it provides great perception for new distributions.The main distribution of LT codes is robust soliton that is denoted by µ(i) .Let R = c.ln(k/δ) √ k expected ripple size for some suitable constant c > 0 .Define (2)  Add ρ(i) to τ (i) and normalize to obtain µ(i) With a good degree distribution, LT codes can perform well.Although robust soliton distribution ensures ripple does not disappear during decoding process with high probability and can achieve good performance for LT codes, newly introduced distributions can improve performance in terms of overhead and recovery probability which is considerable, since cloud storage follows the "pay-as-you-use" paradigm.

Poisson robust soliton distribution
By combining the characteristics of Poisson distribution (PD) and robust soliton distribution (RSD), recently introduced distribution [20] with appropriate parameters can generate more degree-1 compared with RSD.Thus, PRSD successful retrieval probability outperforms RSD in lower overhead.The improved PD (IPD) is given by where is a positive constant.Then, the proposed PRSD is obtained as follows PRSD provides lower average degrees and limited degrees in comparison with RSD.Thus, we can achieve cloud storage with faster retrieval and higher successful decoding probability in lower overheads.

Combined Poisson robust soliton distribution
In order to reduce overhead and consuming the time of encoding and decoding process, CPRSD proposed by combining IPD and RSD is as follows [21] First, η(i) is obtained from a normalization of θ(i), The CPRSD is represented as where the range of a is located between 0 and 1. (3)

System model
To investigate the performance of LT codes with new distributions on a cloud storage system, we consider a small-scale cloud with 15 storage nodes.First, we encode 20 user data of various sizes with LT codes, and then, the distribution of encoded data is accomplished over cloud storage in regional and multi-regional storage modes [22].In regional mode, data are distributed over a region with at least two availability zones, and region selection is based on minimum distance to data owner location.In the multi-regional mode that at least two regions are selected, we assume selection of the first region as the nearest one and random selection of the second region that can be equivalent to user options in cloud storage systems.To resemble the function of our system model to the reality of cloud storage system and also study retrieval time, we consider M/G/1 queue for every storage node [6] and a further M/G/1 queue for the head server to direct retrieval requests to corresponding nodes.
We inspect the successful decoding probability of LT code-based cloud storage with PRSD and CPRSD degree distributions in two state.The first one is a non-removal state of storage nodes, and the latter is a removal state to check the effect of inaccessibility or failure of nodes.The general structure of our system model is shown in Fig. 3.

Parameter selection
In this paper, we study LT codes with k = {100, 250.500} .All simulations run in MAT- LAB.The first step is selecting n, the number of encoded symbols that are required to recover original symbols with high probability.As shown in Table 1, obtained values of n corresponding to k in our system model are almost close to values derived from the empirical model for decodability in [17] and the term k + O( √ k.ln 2 (k/δ)) [18].At the second step, two main parameters of RSD and PRSD, c and δ are determined that play important roles in degree distributions achievement.We investigate how the factors c and δ affect the probability of successful retrieval to discover a mutually suit- able pair for both RSD and PRSD.As illustrated in Figs. 4 and 5 for k = 250 , a higher successful decoding probability of RSD takes place within the range [0.1, 0.3] of δ and two values 0.08 and 0.1 for parameter c.In addition, better performance of PRSD can be observed for c = 0.08 and δ measures located in the range of [0.01, 0.1] .Thus, we set c = 0.08 and δ = 0.1 to reach the highest probable overall performance for both RSD and PRSD, which also stands for other values of k.The measure of failure decoding probability δ is reasonable in practice.
Finally, a as a fundamental parameter of CPRSD is selected to represent the contribution of IPD and RSD, in new degree distribution.Successful decoding probability against a for different k and versus overhead for different a is displayed in Figs. 6  and 7, respectively.As shown in Fig. 6, we can achieve higher successful decoding probabilities by the range [0.3, 0.5] of parameter a for various values of original sym- bols.Since the variation of a has an effect on overhead as well as successful decoding probability, a trade-off is discussed.The trade-off offers the highest possible probability of decoding, while the overhead is kept as small as possible.We set a = 0.4 to reach an acceptable compromise between overhead and successful decod- ing probability, which means the additional contribution of RSD provides much improvement.Based on the analysis of expectations and similarity of mathematical properties of PD and BD when k < 20 , we consider ≈ 3.04 [20].

Theoretical analysis
As mentioned before, exploiting PRSD and CPRSD leads to faster retrieval and higher successful decoding probability in lower overheads in comparison with RSD.We provide a few theoretical analyses of some indicators to prove our claims.
As RSD is the combination of ρ(i) and τ (i) , PRSD is constructed by θ(i) and τ (i) , also the combination of all these distributions generates CPRSD, average degree, degree-one, maximum degree, and the number of encoding symbols is studied approximately through the expectations of ρ(i) and θ(i) as follows.
First, one of the parameters which have an essential role in the retrieval process is the average degree.The average degree of encoding symbols should be as few as possible.
Based on the analysis mentioned in the article, we consider ≈ 3.04 , so By increasing the number of original symbols k, the average degree of ρ(i) tends to H(k).As ln(k) < H (k) < ln(k) + 1 , we have In addition, for k ≥ 25 According to the terms stated above, for k = {100, 250, 500} the average degree of PRSD and CPRSD is smaller than RSD.Less average degree is equivalent to fewer XOR operations required in the encoding and decoding process that can be led to faster retrieval.Second parameter that shows the superiority of PRSD and CPRSD over RSD is the number of degree-one encoded symbols.PRSD and CPRSD provide a higher fraction of degree-one encoding symbols due to the nature of Poisson distribution with appropriate parameter selection.Since the ripple is a set of degree-one encoding symbols in the decoding process, a large expected ripple size at the beginning of the process brings higher successful decoding probability when the decoder receives fewer encoding symbols.The number of degree-one encoding symbols for three distributions are given, respectively, Also, ∀k > 2 , we have In addition, we present a maximum degree that indicates PRSD and CPRSD which outperform RSD regarding better retrieval in lower overheads and time-consuming decoding process.Degree distribution tends to zero when the degree approaches the maximum degree that can be generated.To study a maximum degree, we define ǫ′ instead of zero which is considered small enough.For RSD, As the degree is a positive integer, By considering ǫ′ ≤ 0.001 , the maximum degree of RSD is obtained as To reach the maximum degree, we exploit determining the sign of the above term.Since the degree i is a positive integer and the right expression is negative based on ǫ′ ≤ 0.001 and ≈ 3.04 , the left expression should be located between -1 and 0 to guarantee maxi- mum degree and negativity.Thus, and i max = 22.
In the following for CPRSD, we exert the range of parameters a on the term below As i! ≥ i(i − 1) , we determine the sign of denominator, and thus, Also ǫ′ − 1/i(i − 1) should be negative, and therefore, and i max = 32.Thus, two new defined distributions provide a smaller maximum degree in comparison with RSD.
The last parameter to investigate is the number of encoding symbols.For LT codes with RSD, the number of encoding symbols is k + O( √ k.ln 2 (k/δ)) [18] that is obtained from By neglecting the terms corresponding to mutual distribution τ (i) , we have the number of encoding symbols for PRSD and CPRSD as follows As the maximum measure of 2 e − /2 that happens at = 2 is near the probability of θ(2) , we have Consider the number of encoding symbols for CPRSD as (32)

Degree distributions comparison
In this section, we present the comparison of successful decoding probability against overhead, for RSD, PRSD, and CPRSD on a cloud storage system.Overhead is defined as follows To study LT code-based cloud storage performance, we consider two phases through simulations due to the random nature of encoding and decoding of LT codes and also the validation of our results.One hundred repeats are set for the outer phase, while each outer phase experiences one hundred repeats of the inner phase.The outer phase includes a selection of encoded symbols degree, generating metadata, encoding by LT codes, and finally distributing over the cloud.Furthermore, the inner phase is assumed in every outer phase which encompasses users' requests for various data from diverse geographical locations at random and data retrieval process.As shown in Figs. 8, 9 and 10, for k = {100, 250, 500} and in the non-removal state of storage nodes increasing the number of original symbols, the performance of considered distributions becomes close together.RSD needs more overhead to retrieve successfully, which means much retrieval time and more storage space.Thus, successful decoding probability with CPRSD and PRSD outperforms RSD in particular at lower overheads.
We study the behavior of proposed cloud storage with three distributions in confronting unavailability or loss of encoding symbols.The goal accomplished considering one data center is out of reach randomly in every simulating iteration.Figure 11 depicts the successful decoding probability for k = 250 in the removal state.Therefore, bet- ter robustness and higher successful decoding probability can be achieved by applying CPRSD and PRSD for the various number of original symbols.
(35) ǫ = (n − k)/k Fig. 12 Histogram of required number of encoding symbols for successful data retrieval for k = 250 Successful data retrieval in lower overheads and also in the presence of partly encoding symbols loss is notable attainment in cloud storage systems.Furthermore, it could lead to less storage space, retrieval time, and cost of users and providers, hence more satisfactory services.

Time improvement
We assume a scenario to compute data retrieval time.First retrieval time is defined as follows T w is queuing delay, the second term is data transmission time, and T decoding is the mean of decoding process time in simulations.
As mentioned before, we consider M/G/1 queue for every storage node and head datacenter.Our simulations run based on the following assumptions.In every iteration, selection accomplishes randomly among 2, 4, 6 ms for mean service time of data  centers and among 1, 1.5, 2 ms for variance.The mean arrival rate is assumed 5 ms .Moreover, the arrival rate for the head data center is set to 10, and mean and variance of service time are considered as 1 and 3 ms , respectively.The factor r is assumed 30.03Mbps, and rests on the global average download speed report for mobile Internet in 2019 [23].Generally, the LT decoder needs an undetermined number of encoded symbols from the storage system to retrieve data successfully.More delay arises from this ambiguity in LT code-based cloud storage systems.Thus, there is a compromise between successful decoding probability and retrieval delay.According to our observation during the decoding process, the number of encoding symbols required for successful data retrieval follows normal distribution as shown in a histogram for k = 250 in Fig. 12.Since retrieval time is comparable to the user experience of cloud storage service, we design a scheme in which the decoding process is implemented for the number of encoding symbols lying within two standard deviations from the mean instead of blind search in the almost big interval.Figure 13 shows successful decoding probability against overhead after applying the proposed decoding process in the removal state.
As clearly seen, reduction in successful decoding probability is negligible in particular for PRSD and CPRSD, whereas time reduction is significant according to Table 2. Retrieval time using the proposed decoding process can be decreased up to 70 percent with PRSD and 67 percent with CPRSD.

Conclusion
In this paper, we studied LT code-based cloud storage using newly designed degree distributions.Data retrieval achieves much success with PRSD and CPRSD compared with the conventional solution, RSD, specifically in smaller overheads, moreover in the presence of unavailability or loss of encoding symbols.Furthermore, we proposed a modified decoding algorithm in order to obtain retrieval time improvement.The performance analysis and experimental results show that the proposed LT code-based cloud storage system can provide higher successful decoding probability, less storage space, more robustness, and faster data retrieval.

Fig. 3
Fig. 3 General structure of our LT code-based cloud storage

Fig. 5 Fig. 6
Fig. 5 Successful decoding probability with PRSD for different c and δ

Fig. 7
Fig. 7 Successful decoding probability with CPRSD for k = 250 and different a

Fig. 13
Fig.13 Successful decoding probability with proposed decoding process for k = 250 in removal state

Table 1
Required number of encoding symbols for successful decoding Fig. 4 Successful decoding probability with RSD for different c and δ

Table 2
Retrieval time comparison between main and proposed decoding process