Differentially Encoded LDPC Codes—Part I: Special Case of Product Accumulate Codes

Part I of a two-part series investigates product accumulate codes, a special class of di ﬀ erentially-encoded low density parity check (DE-LDPC) codes with high performance and low complexity, on ﬂat Rayleigh fading channels. In the coherent detection case, Divsalar’s simple bounds and iterative thresholds using density evolution are computed to quantify the code performance at ﬁnite and inﬁnite lengths, respectively. In the noncoherent detection case, a simple iterative di ﬀ erential detection and decoding (IDDD) receiver is proposed and shown to be robust for di ﬀ erent Doppler shifts. Extrinsic information transfer (EXIT) charts reveal that, with pilot symbol assisted di ﬀ erential detection, the widespread practice of inserting pilot symbols to terminate the trellis actually incurs a loss in capacity, and a more e ﬃ cient way is to separate pilots from the trellis. Through analysis and simulations, it is shown that PA codes perform very well with both coherent and noncoherent detections. The more general case of DE-LDPC codes, where the LDPC part may take arbitrary degree proﬁles, is studied in Part II Li 2008.


INTRODUCTION
The discovery of turbo codes and the rediscovery of lowdensity parity-check (LDPC) codes have renewed the research frontier of capacity-achieving codes [1,2].They also revolutionized the coding theory by establishing a new softiterative paradigm, where long powerful codes are constructed from short simple codes and decoded through iterative message exchange and successive refinement between component decoders.Compared to turbo codes, LDPC codes boast a lower complexity in decoding, a richer variety in code construction, and not being patented.
One important application of LDPC codes is wireless communications, where sender and receiver communicate through, for example, a no-line-of-sight land-mobile channel that is characterized by the Rayleigh fading model.It is well-recognized that LDPC codes perform remarkably well on Rayleigh fading channels, that is, assuming the carrier phase is perfectly synchronized and coherent detection is performed; but what if otherwise?
It should be noted that, due to practical issues like complexity, acquisition time, sensitivity to tracking errors, and phase ambiguity, coherent detection may become expensive or infeasible in some cases.In the context of noncoherent detection, the technique of differential encoding becomes immediately relevant.Differential encoding admits simple noncoherent differential detection which solves phase ambiguity and requires only frequency synchronization (often more readily available than phase synchronization).Viewed from the coding perspective, performing differential encoding is essentially concatenating the original code with an accumulator, or, a recursive convolutional code in the form of 1/(1 + D).
In this series of two-part papers, we investigate the theory and practice of LDPC codes with differential encoding.We start with a special class of differentially encoded LDPC (DE-LDPC) codes, namely, product accumulate (PA) codes (Part I), and then we move to the general case where an arbitrary (random) LDPC code is concatenated with an accumulator (Part II) [3].
Product accumulate codes, proposed in [4] and depicted in Figure 1, are a class of serially concatenated codes, where the inner code is a differential encoder, and the outer code is a parallel concatenation of two branches of single-parity check (SPC) codes or a structured LDPC code comprising degree-1 and degree-2 variable nodes.Since the accumulator can also be described using a sparse bipartite graph, a PA code is, overall, an LDPC code.Alternatively, it may also be regarded as a differentially-encoded LDPC code, to emphasize the impact of the inner differential encoder.The reasons to study PA codes are multifold.First, PA codes exhibit an interesting threshold property and remarkable performance,and are well established as a class of "good" codes with rates ≥ 1/2 and performance within a few tenths of a dB from the Shannon limit [4].Here, "good" is in the sense defined by MacKay [2].Second, PA codes are desirable for their simplicity.They are simple to describe, simple to encode and decode, and simple enough to allow rigorous theoretical analysis [4].Comparatively, a random LDPC code can be expensive to describe and expensive to implement in VLSI (due to the difficulty of routing and wiring).Finally, PA codes are intrinsically differentially encoded, which naturally permits noncoherent differential detection without needing additional components.The primary interest is the noncoherent detection case, but for completeness of investigation and for comparison, we also include the case of coherent detection.Under the assumption that phase information is known, we compute Divsalar's simple bounds to benchmark the performance of PA codes at finite code lengths [5], and we evaluate iterative thresholds using density evolution (DE) to benchmark the performance of PA codes at infinite code lengths.The asymptotic thresholds reveal that PA codes are about from 0.6 to 0.7 dB better than regular LDPC codes, but 0.5 dB worse than optimal irregular LDPC codes (whose maximal left degree is 50) on Rayleigh fading channels with coherent detection.Simulations of fairly long block lengths show a good agreement with the analytical results.
When phase information is unavailable, the decoder/ detector will either proceed without phase information (completely blind), or entails some (coarse) estimation and compensation in the decoding process.We regard either case as noncoherent detection.The presence of a differential encoder in the code structure readily lands PA codes to noncoherent differential detection.Conventional differential detection (CDD) operates on two symbol intervals and recovers the information by subtracting the phase of the previous signal sample from the current signal sample.It is cheap to implement, but suffers as much as from 4 to 5 dB in bit error rate (BER) performance [6].Closing the gap between CDD and differentially encoded coherent detection generally requires the extension of the observation window beyond two symbol intervals.The result is multisymbol differential detection (MSDD), exemplified by maximumlikelihood (ML) multisymbol detection, trellis-based multisymbol detection with per-survivor processing, and their variations [7,8].MSDD performs significantly better than CDD, at the cost of a considerably higher complexity which increases exponentially with the window size.To preserve the simplicity of PA codes, here we propose an efficient iterative differential detection and decoding (IDDD) receiver which is robust against various Doppler spreads and can perform, for example, within 1 dB from coherent detection on fast fading channels.
We investigate the impact of pilot spacing and filter lengths, and we show that the proposed PA IDDD receiver requires very moderate number of pilot symbols, compared to, for example, turbo codes [6].It is quite expected that the percentage of pilots directly affects the performance especially on very fast fading channels, but much less expected is that how these pilot symbols are inserted also makes a huge difference.Through extrinsic information transfer (EXIT) analysis [9], we show that the widespread practice of inserting pilot symbols to periodically terminate the trellis of the differential encoder inevitably [6,7] incurs a loss in code capacity.We attribute this to what we call the "trellis segmentation" effect, namely, error events are made much shorter in the periodically terminated trellis than otherwise.We propose that pilot symbols be separated from the trellis structure, and simulation confirms the efficiency of the new method.
From analysis and simulation, it is fair to say that PA codes perform well both with coherent and noncoherent detection.In Part II of this series of papers, we will show that conventional LDPC codes, such as regular LDPC codes with uniform column weight of 3 and optimized irregular ones reported in literature, actually perform poorly with noncoherent differential detection.We will discuss why, how, and how much we can change the situation.
The rest of the paper is organized as follows.Section 2 introduces PA codes and the channel model.Section 3 analyzes the coherently detected PA codes on fading channels using Divsalar's simple bounds and iterative thresholds.Section 4 discusses noncoherent detection and decoding of PA codes and performs EXIT analysis.Finally, Section 5 summarizes the paper.

Channel model
We consider binary phase shift-keying (BPSK) signaling (0→ +1, 1 → −1) over flat Rayleigh fading channels.Assuming proper sampling of the outputs from the matched filter, the received discrete-time baseband signal can be modeled as r k = α k e jθk s k + n k , where s k is the BPSK-modulated signal, n k is the i.i.d.complex AWGN with zero mean and variance σ 2 = N 0 /2 in each dimension.The fading amplitude α k is modeled as a normalized Rayleigh random variable with for α k > 0, and the fading phase θ k is uniformly distributed over [0, 2π].
For fully interleaved channels, α k 's and θ k 's are independent for different time indexes k.For insufficiently interleaved channels, they are correlated.We use the Jakes' isotropic scattering land mobile Rayleigh channel model to describe the correlated Rayleigh process which has autocorrelation R k = (1/2)J 0 (2kπ f d T s ), where f d T s is the normalized Doppler spread, and J 0 (•) is the 0th order Bessel function of the first kind.
Throughout the paper, θ k is assumed known perfectly to the receiver/decoder in the coherent detection case, and unknown (and needs to be worked around) in the noncoherent detection case.Further, the receiver is said to have channel state information (CSI) if α k known (irrespective of θ k ), and no CSI otherwise.

PA codes and decoding analysis
A product accumulate code, as illustrated in Figure 1(a), consists of an accumulator (or a differential encoder) as the inner code, and a parallel concatenation of 2 branches of single-parity check codes as the outer code.PA codes are decoded through a soft-iterative process where soft extrinsic information is exchanged between component decoders conforming to the turbo principle.The outer code, modeled as a structured LDPC code, is decoded using the messagepassing algorithm.The inner code, taking the convolutional form of 1/(1 + D), may be decoded either using the trellisbased BCJR algorithm, or a graph-based message-passing algorithm.The latter, thanks to the cycle-free code graph of 1/(1 + D), performs as optimally as the BCJR algorithm, but consumes several times less of complexity [4,10].Thus, the entire code can be efficiently decoded through a unified message-passing algorithm, driven by the initial loglikelihood ratio (LLR) values extracted from the channel [4].For Rayleigh fading channels with perfect CSI, that is, α k is known ∀k, the initial channel-LLRs are computed using and for Rayleigh fading channels without CSI, where E[α] = √ π/2 is the mean of α.Due to the space limitation, we omit the details of the overall message-passing algorithm, but refer readers to [4].

COHERENT DETECTION
This section investigates the coherent detection case on Rayleigh fading channels.We employ Divsalar's simple bounds and the iterative threshold to analyze the ensemble average performance of PA codes, and simulate individual PA codes at short and long lengths.

Simple bounds
Union bounds are simple to compute, but are rather loose at low SNRs.Divsalar's simple bound is possibly one of the best closed-form bounds [5].Like many other tight bounds, the simple bound is based on the second Gallager's bounding techniques [1].By using numerical integration instead of a Chernoff bound and by reducing the number of codewords to be included in the bound, Divsalar was able to tighten the bound to overcome the cutoff rate limitation.Since the simple bound requires the knowledge of the distance spectrum, a hard-to-attain property especially for concatenated codes, it has not seen wide application.Here, the simplicity of PA codes permits an accurate computation of the ensemble-average distance spectrum (whose details can be found in [4]), and thus enables the exploitation of the simple bound.
The technique of the simple bound allows for the computation of either a maximum likelihood (ML) threshold in the asymptotic sense [4,5], or a performance upper bound with respect to a given finite length.Divsalar derived the general form of the simple bound on independent Rayleigh fading channels with perfect CSI.Following a similar line of reasoning, below we extend it to the case of non-CSI.

Gallager's second bounding technique
Gallager's second bounding technique sets the base for many tight bounds including the simple bounds [1].It states that Pr (error) ≤ Pr (error, where r = γ α s + n is the received codeword (N-dimensional noise-corrupted vector), s is the transmitted codeword vector, n is the noise vector whose components are i.i.d.Gaussian random variables with zero mean and unit variance, γ is the known constant (in modulation), α is the N × N matrix containing fading coefficients (α is an identity matrix for AWGN channels),and R denotes a region in the observed space around the transmitted codeword.To get a tight bound, optimization and integration are usually needed to determine a meaningful R.

Divsalar's simple bound for independent rayleigh fading channels with CSI
For Rayleigh fading channels, the decision metric is based on the minimization of the norm ||r − γαs||, where s, r, and α are the transmitted signal, received signal, and the fading amplitude in vector form, respectively, and γ is the amplitude of the transmitted signal such that γ 2 /2 = E s /N 0 .For a good approximation of the error using (3),and for computational simplicity, the decision region R was chosen as an N-dimensional hypersphere centered at ηγαs and with radius √ NR, where η and R are the parameters to be optimized [5].
When perfect CSI is available, the effect of fading can be compensated through a linear transformation on γ α s.In particular, a rotation e jϕ and a rescaling ζ have shown to yield a good and analytically feasible solution [ which leads to the upper bound of the error probability of an (N, K, R) code [5] P(e) where , for bit error rate. (9)

Extension of the simple bound to the case of No CSI
Another simple and reasonable choice of the decision region is an ellipsoid centered at ηγs, which can be obtained by rescaling each coordinate of r so as to compensate for the effect of fading where η and R are optimized.For independent Rayleigh channels without CSI, since accurate information on α is unavailable,we resort to the expectation of the fading coefficient (10), where I is an identity matrix.By replicating the computations described in [5], we obtain the upper bound of the bit error rate for independent Rayleigh channels without CSI: where and δ and γ N (δ) are the same as in ( 8) and ( 9).Please note that the aforediscussed extension to the fading case with no CSI slightly loosens the simple bound, but it preserves the computational simplicity.It is possible for a more sophisticated transformation to yield tighter bounds but not necessarily a feasible analytical expression.
Figure 2 plots the simulated BER performance and the simple bound of a (1024,512) PA code on independent Rayleigh fading channels with and without CSI.Since an optimal ML decoder is assumed, and the ensemble average distance spectrum is used, in the computation, the simple bound represents the best ensemble average performance, and may not accurately reflect the individual PA code being simulated.Nevertheless, we see that the bound is fairly tight.It provides a useful indication of the code performance at SNRs below the cut-off rate, and, at high SNRs, it joins with the union bound to predict the error floor.

Threshold computation via the iterative analysis
The ML performance bound evaluated in the previous subsection factors in the finite length of a PA code ensemble,  but the assumption of an ML decoder may be optimistic.Below we account for the iterative nature of the practical decoder and compute an asymptotic iterative threshold using the renowned method of density evolution [11].
A useful tool for analyzing the iterative decoding process of sparse-graph codes, density evolution examines the probability density function (pdf) of exchanging messages in each step and can,literally speaking, track the entire decoding process.In general, we are more interested in the asymptotic SNR thresholds, η, which are defined as the critical channel condition that isrequired for the decoding process to converge unanimously to the correct decision: where y = ±1 is the BPSK modulated signals, and f (l)

Ly
denotes the pdf of LLR information on y after the lth decoding iteration.
Tracking the density of the messages requires the computation of the initial pdf of the LLR messages from the channel,and the transformation of the message pdf 's in each step of the decoding process.Although Gaussian approximation is reported to incur only very little inaccuracy on AWGN channels [12,13], the deviation is larger on fading channels, since the pdf of the initial LLRs from a fading channel looks different from a Gaussian distribution.Hence, exact density evolution is used to preserve accuracy.

Initial LLR pdf from the channel
Hou et al. showed in [14] that the pdf of the LLRs from independent Rayleigh channelswith perfect CSI is given by (assuming BPSK signaling and the all-zero sequence is transmitted) Using integrals from [15], we further simplify (14) to For the case when CSI is not available to the receiver, we assume that the Rayleigh-faded and AWGN-corrupted signals follow a Gaussian distribution in the most probable region.The pdf of the initial messages is then derived as where

Evolution of LLR pdf in the decoder
To track the evolution of the pdf 's along the iterative process can either employ Monte Carlo simulation, or, more accurately and more efficiently, to proceed analytically through discretized density evolution.The latter is possible due to the simplicity in the code structure and in the decoding algorithm of PA codes.As a selfcontained discussion, we summarize the major steps of the discretized density evolution of PA codes in the Appendix, but for details, please refer to [4].Using (15) for perfect CSI case or (16) for no CSI case (i.e., substituting them in (A.4) and (A.5) in the Appendix), the thresholds of PA codes on Rayleigh channels can be computed through (A.3) to (A.12) in the Appendix.The computed thresholds are a good indication of the performance limit as the code length and the number of iterations increase without bound.
Figure 3 plots the thresholds as well as the simulation results of PA codes on independent Rayleigh channels with and without CSI.We see that the analytical results are consistent with the simulation results for fairly large block sizes.Here, simulations are evaluated after the 50th iteration.As the block size and the number of iterations continue to increase, we expect the actual performance to converge to the thresholds.
Table 1 compares the thresholds of PA codes with those of LDPC codes for several code rates.The ergodic capacity of the independent Rayleigh fading channel is also listed as reference.We see that the thresholds of PA codes are about 0.6 dB from the channel capacity, and simulations of fairly   large block sizes are about 0.3-0.4dB from the thresholds.Compared to the thresholds of LDPC codes reported in [14], rate 1/2 PA codes are from about 0.6-0.7 dB better (asymptotically) than (3, 6)-regular LDPC codes, but are about 0.5 dB worse (asymptotically) than irregular LDPC codes.It should be noted that these irregular LDPC codes are specifically optimized for Rayleigh fading channels and have maximum variable node degree of 50.It is fair to say that PA codes perform on par with LDPC codes (using coherent detection).

Simulation with coherent detection
To benchmark the performance of coherently detected PA codes, several PA configurations are simulated on correlated and independent Rayleigh fading channels.In each global iteration (i.e., iteration between the inner decoder and the outer decoder), two local iterations of the outer decoding are performed.This scheduling is found to strike the best tradeoff between complexity and performance (with coherent detection).

Coherent BPSK on independent rayleigh channels
Figure 4 shows the performances of rate 1/2 PA codes on independent Rayleigh fading channels with and without channel state information, respectively.Bit error rates after 20, 30, and 50 (global) iterations are plotted, and data block sizes from short to large (512, 1 K, 4 K, and 64 K) are evaluated to demonstrate the interleaving gain.For comparison purpose, the corresponding channel capacities are also shown.The simulated performance degradation due to the lack of CSI is about 0.9 dB, which is consistent with the gap between the respective channel capacities.
Compared to the (3, 6)-regular LDPC codes reported in [14],the performance of this rate 1/2, codeword length N = 128 × 1024 = 1.3 × 10 5 PA code is about 0.4 and 0.25 dB better than regular LDPC codes of length N = 10 5 and 10 6 on independent Rayleigh channels.It is possible that optimized irregular LDPC codes will outperform PA codes (as indicated by their thresholds), but for regular codes, PA codes seem one of th best.

Coherent BPSK on correlated rayleigh channels
Figure 5 shows the performance of PA codes on correlated fading channels.Perfect CSI is assumed available to the receiver, and an interleaver exists between the PA code and the channel (to partially break up the correlation between the neighboring bits).Short PA codes with rate 1/2 and 3/4 are simulated on two common fading scenarios with normalized Doppler spreads f d T s = 0.01 and 0.001, respectively.As expected, the performance deteriorates rapidly as f d T s decreases, since slower Doppler rate brings smaller diversity order.Due to the interleaver between the PA code and the channel, the impact of slow Doppler rate is less severe for larger block sizes than for smaller ones.Whereas K = 1 K PA code loses about 7 dB at BER = 10 −4 as f d T s changes from 0.01 to 0.001, the loss with K = 4 K PA code is less than 5 dB.
To illuminate how well short PA codes perform on correlated channels, we compare them with turbo codes (which are the best-known codes at short code lengths) in Figure 5.The comparing turbo code has 16-state component convolutional codes whose generator polynomial is (1, 35/23) oct and which are decoded using log-domain BCJR algorithm.Code rate is 075, data block size is 4 K, and Srandom interleavers are used in both codes to lower the possible error floors.Curves plotted are for PA codes at the 10th iteration and turbo codes at the 6th iteration.We observe that turbo codes perform about 0.6 and 0.7 dB better than PA codes for f d T s = 0.001 and 0.01, respectively.However, it should be noted that this performance gain comes at a price of a considerably higher complexity.While the message-passing decoding of a rate-0.75PA code at the 10th iteration requires about 267 operations per data bit [4], the log-domain BCJR decoding of a rate-0.75turbo code at the 6th iteration requires as many as 9720 operations per data    bit, a complexity 35 times larger.Hence, PA codes are still attractive for providing good performance at low lost.

NONCOHERENT DETECTION OF PA CODES
This section considers noncoherent detection.The channel model of interest is a Rayleigh fading channel with correlated fading coefficients.

Iterative differential detection and decoding
PA codes are inherently differentially encoded which makes it convenient for noncoherent differential detection.Although multiple symbol differential detection is possible, for complexity concerns, we consider a simple iterative differential detection and decoding receiver, whose structure is shown in Figure 6.The IDDD receiver consists of a conventional differential detector with 2-symbol observation window (the current and the previous), a phase tracking filter and the original PA decoder (that used in coherent detection [4]).Trellis structure is employed to assist the detection and decoding of the inner differential code 1/(1 + D), but unlike the case of multiple symbol detection, the trellis is not expanded and has 2 states only.Soft information is passed back and forth among different parts of the receiver conforming to the turbo principle.Let x denote the input to the inner differential encoder or the output from the outer code, and let y denote the output from the differential encoder or the symbol to be put on the channel (see Figure 6).The differential encoder implements y k = x k y k−1 for x k , y k ∈ {±1} (BPSK signal mapping 0 → +1, 1 → −1).The channel reception is given by r k = α k e jθk y k + n k , where the channel amplitudes (α k 's) and phases (θ k 's) are correlated, and the complex white Gaussian noise samples (n k 's) are independent.
In theory, differential decoding does not require pilot symbols.In practice, however, pilot symbols are inserted periodically even with multiple symbol detection, to avoid catastrophic error propagation in differential decoding.This is particularly so for the fast fading case where phases (θ k ) are changing rapidly (will show later).Hence, some of the r k 's (and y k 's) in the received sequence are pilot symbols.
We use L to denote the LLR information,superscript (q) to denote the qth (global) iteration, and subscript i, o, ch, and e to denote the quantities associated with the inner  code, the outer code, the fading channel, and "the extrinsic", respectively.

IDDD receiver
Here is a sketch of how the proposed IDDD receiver operates.In the first iteration, the switch in Figure 6 is flipped up.The samples of the received symbols, r k , are fed into the conventional differential detector which computes u k = Real(r k r * k−1 ) and subsequently soft LLR L ch (x k ) from u k .Here * denotes the complex conjugate.L ch (x k ) is then treated as L (1)  e,i (x k ) and fed into the outer decoder, which, in return, generates L (1)  e,o (x k ) and passes it to the inner decoder for use in the next detection/decoding iteration.Starting from the second iteration, the switch in Figure 6 is flipped down, and channel estimation for α k and θ k is performed before the "coherent" detection and decoding of the inner and outer code.After Q iterations, a decision is made by combining the extrinsic information from both the inner and the outer decoders: x k = sign(L (Q)  e,i (x k ) + L (Q) e,o (x k )).In the above discussion, we have ignored the existence of the random interleaver, but it is understood that proper interleaving and de-interleaving is performed whenever needed.

Conventional differential detector for the first decoding iteration
With the assumption that the carrier phases are near constant between two neighboring symbols, the conventional differential detector (in the first iteration) performs u k ). Hard decision of x k is obtained by simply checking the sign of u k .Computing soft information L ch (x k ) from u k requires the knowledge of the pdf of u k .The conditional pdf of u k given α k and x k is [16] f U|α,X (u|α, x)= where Q(a, b) is the Marcum Q-function.It is then possible to get the true pdf of u k using Since the computation of Marcum Q-function is slow and does not always converge at large values, an exact evaluation of ( 18) and hence the computation of L ch (x k ) can be difficult.We propose a simple approximation which evaluates (17) with α substituted by its mean E[α].This leads to The corresponding LLR from the channel can then be computed by An even more convenient compromise is to assume u k is Gaussian distributed, as is used in [17] and a few other papers.Under this Gaussian assumption, we get Alternatively, instead of using the conventional differential decoding in the first iteration, a channel estimation followed by the decoding of the inner 1/(1 + D) code can Pdf: be used, which makes the first iteration exactly the same as subsequent iterations.This third option then leads to pilot symbol assisted modulation (PSAM), which has slightly higher complexity than using differential detection in the first iteration.
To see how accurate the above treatments are, we plot in Figure 7 several curves approximating the pdf of u k .From the most sharp and asymmetric to the least sharp and symmetric, these curves denote the exact pdf of f U|X (u | x = +1) from Monte Carlo simulations (histogram, can be regarded as the numerical evaluation of ( 18)), the "mean-α approximated" pdf from (19) and the Gaussian approximated pdf from (21).From the figure, the Gaussian approximation does not reflect the true pdf well, but this inaccuracy turns out not severely affecting the overall IDDD performance.As shown later in Figure 13, all the three treatments (Gaussian approximation, mean-α approximation, and PSAM) result in very similar decoding performance.We attribute this to the fact that the inaccuracy affects mostly the first iteration, and subsequent iterations can help mitigate the loss.Thus, Gaussian approximation still presents itself as a simple and viable approach for noncoherent differential decoding.

Channel estimator
The channel estimator in the IDDD receiver (Figure 6) may be implemented in several ways.Here we use a linear filter of (2L + 1) taps to estimate α k 's and θ k 's in the qth iteration where p l denotes the coefficient of the lth filter tap, and y (q−1) k denotes the estimate on y k from the feedback of the previous iteration.For soft feedback, y (y k ))/2), and for hard feedback, y (y k ) is generated toge-ther with L (q−1) e,i (x k ) by the inner decoder in the (q − 1)th decoding iteration (please refer to [4] for the step-by-step message-passing decoding algorithm of 1/(1 + D) code).In the first iteration, L (0)  e,i (y k )'s are initiated as zeros for coded bits and a large positive number (i.e., +∞) for pilot symbols.
Regarding the choice of the filter, we take a Wiener filter, since it is known to be optimal for estimating channel gain in the minimum mean-square-error (MMSE) sense, when the correlation of the fading process, R k s, are known [18].The filter coefficients, p −L , p −L+1 , . . ., p L , are obtained from the Wiener-Hopf equation where R k = (1/2)J 0 (2kπ f d T s ).Since the computation of p l 's from ( 24) involves an inverse operation on a matrix (one-time job), it may not be computable when the matrix becomes (near) singular, which occurs when the channel is very slow fading.In such cases, a low-pass filter, or a simple "moving average" can be used [6].

EXIT charts
We perform EXIT analysis [9] to generate further insights into PA codes and the proposed noncoherent IDDD receiver.In EXIT charts, the exchange of extrinsic information is visualized as a decoding/detection trajectory, allowing the prediction of the decoding convergence and thresholds [9].Several quantities, like the bit error rate, the mean of the extrinsic LLR information, and the equivalent SNR value, were previously used to depict the characteristics and relations of the component decoders, but the mutual information is shown to be the most robust among all [9].The mutual information between the binary bit y k and its corresponding LLR values is defined as where L(Y ) is either the a priori information L a (Y ) or the extrinsic information L e (Y ), and The second equality holds when the channel is output symmetric such that f L(y , and the third equality holds when the received messages satisfy the consistency condition (also known as the symmetry condition): [11].Note that the consistency condition is an invariant in the message-passing process on a number of channels including the AWGN channel and the independent Rayleigh fading channel with perfect CSI; but it is not preserved on fading channels without CSI or with estimated (thus imperfect) CSI, since the initial density function evaluated in the latter cases is but an approximation of the actual pdf of the LLR messages.Thus, (25) should be used to compute the mutual information in those cases.We use the X-axis to represent the mutual information to the inner code (a prior) or from the outer code (extrinsic), denoted as I a,i /I e,o , and the Y-axis to represent the mutual information from the inner code or to the outer code, denoted as I e,i /I a,o .

Pilot symbol insertion
A practicality issue about noncoherent detection is pilot insertion.The number of pilot symbols inserted should be sufficient to attain a reasonable track of the channel, but not in excess.Many researchers have reported that excessive pilot symbols not only cause wasteful bandwidth expansion, but actually degrade the overall performance, since the energy compensation for the rate loss due to excessive pilot more than outweighs the gain that can be obtained by a finer channel tracking.This trade-off issue has long been noted in literature, but little attention has been given to another issue of no less importance, namely, how pilots should be inserted when differential encoding or other trellis-based coding/modulation front-end is used.There exist at least two ways to insert pilot symbols in a differential encoder.The widespread approach is to periodically terminate the trellis [6,7], as shown in Figure 8(a), such that pilot symbols are used to estimate the channel and at the same time participate in the trellis decoding.Seemingly plausible, this turns out to be a bad strategy, since segmenting the trellis into small chunks significantly increases the number of short error events, and consequently incurs a loss in performance.
The negative effect of trellis segmentation is best illustrated by the EXIT chart in Figure 9. EXIT curves corre- sponding to the differential decoder with 0%, 4%, 10%, and 20% pilot insertion are plotted for two different SNR values.To eliminate the impact of other factors, the four curves in each SNR set are given the same energy per transmitted symbol and perfect knowledge on the fading phase and amplitude is provided to all the decoders (irrespective of the number of pilot symbols).Thus the difference between the curves in each family is only due to the difference in pilot spacing.At the left end of the curves (when input mutual information is small), a larger number of pilot symbols correspond to a better performance (a higher output mutual information).This is because when little information is provided from the outer code, pilot symbols become the primary contributor to a priori information.However, the situation is completely reversed toward the right end of the EXIT curves.We see that more pilot symbols actually degrade the performance, the reason being, given sufficient information provided by the outer code, pilot symbols no longer constitute the key source of a priori information; on the other hand, they segment the trellis and shorten error events, rendering an opposite effect to spectrum thinning and thus deteriorating the performance.The performance loss is more severe when more pilot symbols are inserted and when the code is operating at a relatively low SNR level.It is worth noting, for example, with 20% of pilot insertion (pilot spacing is 5), even provided with a perfect mutual information from the outer code (I a,i = 1, but the channel remains noisy), the trellis decoder nevertheless fails to produce sufficient output mutual information I e,i .As such, the inner EXIT curve is bound to intersect the outer EXIT curve at a rather early stage of the iterative process, causing the iterative decoder to fail at a high BER level (not to mention this EXIT curve has 20% more of energy consumption than the no-pilot case).
The implication of this EXIT analysis is that the widespread approach of inserting pilot symbols as part of the trellis could cause deficiency for differential encoding (and other serial concatenated schemes with inner trellis codes).Specifically, unless the outer code is itself a capacityachieving code at some SNR, the inner and outer EXIT curves will intersect, result in convergence failure and cause error floors.We observe that the more the pilot symbols, the higher the error floor; and the lower the code rate (lower SNR), the more severe the impact.It is therefore particularly important to keep the number of pilot symbols in such schemes minimal, so that error floors do not occur too early.This analysis also suggests an alternative, and potentially better-performing, way of pilot insertion, namely, separating pilots from the trellis and thus not affecting error events; see Figure 8(b).
It should be pointed out, that the level of the impact caused by trellis segmentation may be very different for different outer codes.Many (outer) codes, including single parity check codes, block turbo codes (i.e., turbo product codes) and convolutional codes, will see a large impact, since these (outer) codes require sufficient input information in order to produce perfect output information, or, put another way, these codes alone are not "good" codes (good in the sense as MacKay defined in [2]).However, "good" codes like LDPC codes will likely see a much smaller impact.This is because an ideal LDPC code has an EXIT curve shaping like a box (e.g., see [3,Figure 3]) which can produce perfect output information as long as the input information is above some threshold (without requiring I a,i = 1).Alternatively, one may also interpret it as: ideal LDPC codes have large minimum distances and are capable of correcting short error events including those caused by the segmentation effect.
To verify the analytical results, we simulate the performance of a rate 1/2, data block size K = 32 K PA code with different strategies of pilot insertion; see Figure 10.The normalized Doppler spread is f d T s = 0.01, and error rates evaluated after 10 decoding iterations.Solid lines represent the cases where perfect channel knowledge is known to the receiver, and dashed lines represent the case where noncoherent detection is used.Comparing solid curves, we see a drastic performance gap results from different strategies of pilot insertion.In this specific case, by segmenting the trellis every 10 symbols, trellis-segmented pilot insertion losses more than 3 dB at BER of 10 −4 than otherwise.The dashed curve corresponds to the same PA code noncoherentlydetected via the IDDD receiver discussed before, where 10% of pilot symbols are inserted using the strategy in Figure 8(b) and where an 81-tap wiener filter is used to estimate the channel.It is interesting to note that if one overlooks the impact of pilot insertion strategies, one might arrives  at a paradox result that noncoherent detection (dashed line) performs (noticeably) better than coherent detection (rightmost solid line)!

Impact of the pilot symbol spacing and filter length
We now investigate how the number of pilot symbols and the length of the estimation filter affect the performance of noncoherent detection.Figure 11 illustrates the impact of different pilot spacing on the BER performance of fast fading channels where the normalized Doppler spread takes f d T s = 0.05, 0.02 or 0.01.We observe the following: (1) The IDDD receiver is rather robust for different Doppler rates.
(2) Smaller pilot spacing, such as <6 symbols, is undesirable, whose consumption of additional energy more than outweighs any gain it may bring.(3) The code performance at high Doppler rates is more sensitive to pilot spacing than that at lower Doppler rates.At the normalized Doppler rate of 0.01 (already fast fading), noncoherently detected PA codes tolerate pilot spacing as small as 6 symbols and as large as 45 to 50 symbols (put aside the bandwidth issue); but at very fast Doppler rate of 0.05, pilot spacing beyond 7-9 symbols will soon cause drastic performance degradation.
For comparison, we also plot the case where pilot symbols periodically terminate the trellis (dashed line), which, due to trellis segmentation, experiences inferior performance when pilot spacing is small.Compared to differentially encoded turbo codes [6], PA codes appear to require fewer pilot symbols (we note that in the study of differentially encoded turbo codes in [6], the authors terminated the trellis periodically with pilot symbols, which may have made the   tolerant range of pilot spacing (at the small spacing end) smaller than otherwise).
The impact of the length of the channel tracking filter is also studied.We observe that while the filter length affects the overall performance, the impact is limited compared to pilot spacing.This is consistent with what has been reported in other studies [6] and is not a new discovery.Hence, we omit the plot.

Simulation results of noncoherent detection
The performance of noncoherently detected PA codes on fast Rayleigh fading channels are presented below.Unless otherwise indicated, the BER curves shown are after 10 global iterations, and in each global iteration, 4 to 6 local iterations of the outer code are performed.We have chosen these parameters on the basis of a set of simulations and tradingoff between performance and complexity.

Noncoherent detection of PA codes with different receiver strategies
We compare the BER performance of 4 types of IDDD strategies for a K = 1 K, R = 3/4 PA code on a f d T s = 0.01 Rayleigh fading channel in Figure 12. "IDDD-1" uses the conventional differential detection with Gaussian approximation (22) to compute L ch (x k ) in the first iteration, and soft feedback of y k in all iterations to assist channel estimation; "IDDD-2" uses conventional differential detection with "mean-α" approximation (20) in the first iteration and soft feedback in all iterations; "IDDD-3" is PSAM with soft feedback; and "IDDD-4" is PSAM with hard feedback.In all cases, 4% of pilot symbols are inserted and curves shown are after 10 iterations.Different decoding strategies in the first iteration does not affect the performance much, and the performance is not very sensitive to hard or soft feedback either.Although not shown, simulations of a long PA code (K = 48 K) of the same (high) rate (R = 3/4) reveal a similar phenomenon.It is possible, however, that other codes may be more sensitive to the difference in decoding strategies especially the difference in the feedback information [6].

Comparison of noncoherent detection with coherent detection
Figure 13 shows the performance of rate 3/4 PA codes after 10 iterations on fast Rayleigh fading channels with Doppler rate T s f d = 0.01.Short block size of 1 K and large block size of 48 K are evaluated.In each case, a family of 5 BERversus-E b /N 0 curves, accounting for rate loss due to pilot insertion, are plotted.The three leftmost curves are the ideal coherent case with knowledge of fading amplitudes and phases provided to the receiver, and the two right curves are the noncoherent case where IDDD is used to track amplitudes and phases.In both the coherent and the noncoherent case, trellis segmentation incurs a small performance loss, but since the pilot spacing is not very small (every 25 symbols), the effect is not as drastic as the case in Figure 10.The noncoherent cases are about 1 dB and 0.55 dB away from the ideal coherent case at BER of 10 −4 for block sizes of 48 K and 1 K, respectively.This satisfying performance is achieved with only 4% of pilot insertion and a very low-complexity IDDD receiver.

CONCLUSION
Previous work has established product accumulate codes as a class of provenly "good" codes on AWGN channels, with low linear-time complexity and performances close to the Shannon limit.This paper performs a comprehensive study of product accumulate codes on Rayleigh fading channels with both coherent and noncoherent detection.Useful analytical tools including Divsalar's simple bounds, density evolution, and EXIT charts are employed, and extensive simulations are conducted.It is shown that PA codes not only perform remarkably well with coherent detection, but the embedded differential encoder makes them naturally suitable for noncoherent detection.A simple iterative differential detection and decoding (IDDD) strategy allows PA codes to perform only 1 dB away from the coherent case.Another useful finding reveals that the widespread practice of inserting pilot symbols to terminate the trellis actually incurs performance loss compared to when pilot symbols are inserted as separate parts from the trellis.
We conclude by proposing product accumulate codes as a promising low-cost candidate for wireless applications.The advantages of PA codes include (i) they perform very well with coherent and noncoherent detection (especially at high rates), (ii) the performance is comparable to turbo and LDPC codes, yet PA codes require much less decoding complexity than turbo codes and much less encoding complexity and memory than random LDPC codes, and (iii) the regular structure of PA codes makes it possible for lowcost implementation in hardware.

1 BER
Simulations and thresholds of PA codes

Figure 3 :
Figure 3: Thresholds computed using density evolution and simulations (data block size K = 64 K).

R
= 0.5 PA, independent fading, no CSI Shannon limit

Figure 5 :
Figure 5: Performance of PA codes on correlated Rayleigh fading channels with CSI.Data block length 4 K, normalized Doppler rate f d T s = 0.01, 0.001, rate of PA codes 0.5 and 0.75, rate of turbo codes 0.75, component codes of the turbo code (1, 35/23) oct , 10 iterations for PA codes, and 6 iterations for turbo codes.

Figure 6 :
Figure 6: Structure of iterative differential detection and decoding receiver.

EURASIPFigure 8 :
Figure 8: Trellis diagram of binary differential PSK with pilot insertion.(a) Pilot symbols periodically terminate the trellis.(b) Pilot symbols are separated from the trellis structure.

Figure 9 :
Figure 9: The effect of pilot symbols segmenting the trellis on the performance of the differential decoder.Normalized Doppler rate f d T s = 0.01, E s /N 0 = 4.75 dB and 0 dB, perfect CSI.

4 10 − 3 10 − 2 BERFigure 11 :
Figure 11: Effect of the number of pilot symbols on the performance of noncoherent detected PA codes on correlated Rayleigh channels with f d T s = 0.01.Code rate 0.75, data block size 1 K, filter length 65, 10 (global) iterations, 4 (local) iterations within the outer code of PA codes.

Figure 12 :
Figure 12: Comparison of BER performance for several noncoherent receiver strategies on correlated Rayleigh channels with f d T s = 0.01.Code rate 0.75, data block size 1 K, 4% of bandwidth expansion, filter length 65, 10 (global) iterations each with 4 (local) iterations for the outer decoding.

Figure 13 :
Figure 13: Comparison of BER performance for several transmission/reception strategies for PA codes of large and small block sizes on correlated Rayleigh channels with f d T s = 0.01.Code rate 0.75, data block size 48 K and 1 K, 4% of bandwidth expansion, filter length 65, 10 (global) iterations each with 4 (local) iterations for the outer decoding.
Figure 10: Performance of PA codes with different pilot insertion strategies.Normalized Doppler rate f d T s = 0.01, code rate 0.5, data block size 32 K, 0% or 10% pilot insertion, 10 iterations.