EURASIP Journal on Wireless Communications and Networking 2004:1, 113–122 c ○ 2004 Hindawi Publishing Corporation A Low-Complexity Blind Multiuser Receiver for Long-Code CDMA

Receivers for long-code systems are for computational reasons usually based on simple matched-ﬁlter techniques, and hence su ﬀ er from multiaccess interference. Decorrelating RAKE and MMSE receivers do not have this problem but have not been widely studied due to the apparent complexity of the inversion of a large code matrix. Tong, van der Veen Dewilde, and Sung (IEEE Tr. Signal Proc., 2003) derived a blind decorrelating RAKE receiver (DRR) and channel estimation algorithm for long-code CDMA systems, and showed how it can be e ﬃ ciently implemented. In this paper, we continue on that work. We propose both single-user and multiuser blind source-channel estimation algorithms by making use of an iterative estimation scheme initialized by the DRR. Simulation results show signiﬁcant improvement, even in heavily loaded systems. Moreover, with an implementation based on time-varying system theory, the proposed algorithm can be implemented e ﬃ ciently at a cost similar to the RAKE.


INTRODUCTION
Long-code (or aperiodic-code) DS-CDMA systems are currently being used in the IS-95 mobile communication network standard and have been adopted in several thirdgeneration standards such as UMTS.Originally, the receivers proposed for such systems were based on the RAKE structure, that is, banks of matched filters which correlate the received data with the desired user's code, followed by a combining of the outputs (RAKE fingers).Since multiuser interference is not completely cancelled, the performance is degraded, especially when the network is heavily loaded and power-control imperfect.It is therefore interesting to look at multiuser receivers.
Channel estimation and multiuser detection for longcode wideband CDMA have not seen the same levels of attention as their short-code equivalents, yet have been considered by a number of authors and are receiving renewed interest.A first classification of the available literature can be made according to the assumptions posed on the scenario.
Here we consider wideband channels, for which equalization is needed.
(ii) Uplink versus downlink scenarios.We will consider only the uplink.The downlink case is different because users are perfectly synchronized, orthogonal, and with the same propagation channel, and only a single user needs to be decoded.(iii) Synchronous and asynchronous transmissions.We consider the asynchronous case.(iv) Training-based channel estimation algorithms versus blind algorithms.We consider the blind case.
The complexity of the problem greatly depends on these assumptions.For example, in the case of synchronous transmissions and delay spreads of at most a few chips, the receiver can drop the samples that have intersymbol interference (ISI) [1,2,3,4].This decouples the problem and allows symbolby-symbol processing.
For asynchronous systems, Buzzi and Poor [5,6] consider nonblind channel estimation using training symbols for all users; they also consider sequential interference cancellation (SIC) techniques with a complexity quadratic in the code length/processing gain (the algorithm proposed in this paper has linear complexity).With known or iteratively estimated symbols, the channel estimation step in [5] and also [7,8] is comparable to our scheme.In these papers, a large matrix inversion with a complexity cubic in the number of users and processing gain is avoided by iterative techniques (gradient descent), leading to a quadratic complexity.
Blind techniques based on second-order moment matching (i.e., stochastic techniques) have appeared in [9,10,11,12,13,14,15].These rely on the convergence of time averages, which often requires hundreds of symbols.Other approaches are based on iterative optimization of a likelihood function [16,17], which tends to have a very high complexity.Several other approaches are valid only for the downlink, for example [1]; see also [18] which contains an extensive reference list.
The algorithms in this paper continue on the blind multiuser joint symbol-channel estimation techniques in [19,20] and can be called deterministic, since no statistical model of the sources is assumed.In these papers, Tong et al. considered an uplink decorrelating RAKE receiver (DRR) algorithm where the base station knows all codes.By constructing and inverting a code matrix, a blind decorrelating RAKE and MMSE receiver was derived to estimate the channel and desired user symbols, based on all samples in a frame.After the decorrelating step, the users are treated independently, which is computationally advantageous but gives suboptimal performance when compared to an informed multiuser MMSE receiver.There are two reasons for this.Firstly, due to the code inversion, the noise becomes correlated among symbols and users.This reduces the performance of the subsequent single-user estimation and detection step.A second and more important reason is that code inversion followed by channel inversion is suboptimal, and gives more noise enhancement than the inversion of the product of the code and channel matrices.In this paper, we take these effects into account.
We propose to use the single-user channel estimates from the DRR as an initial point for an iterative symbol/channel estimation algorithm which also considers the noise correlations.This can be done on a per-user basis, or, with better performance, jointly in a multiuser fashion.In heavily loaded systems, this algorithm shows a significant improvement over the current decorrelating RAKE receiver and the conventional RAKE receiver.
The proposed multiuser algorithm is by itself not a very surprising result.Similar iterative receivers are known for short-code (periodic-code) CDMA systems, for example, the (parallel interference cancellation PIC) receivers, and for long-code CDMA an iterative blind receiver that appears to be related to ours was proposed in [8].Such receivers usually act on symbol-by-symbol data, whereas the proposed algorithm acts on a slot of data (M symbols).What is new here is the observation that the blind DRR (or the related blind RAKE receiver) provides a very good initial point for the iteration, and the observation that an efficient implementation for the algorithm is possible.A direct implementation has a complexity that grows with M 3 , and would soon be prohibitive.However, the matrices to be inverted are sparse and structured (they are related to a band matrix after permutations).As in [20], we consider the use of time-varying state space theory developed by Dewilde and van der Veen [21] to implement matrix multiplications, QR-factorizations, and matrix inversions. 1 We will demonstrate that the resulting complexity of the iteration is similar to that of the DRR, that is, linear in the number of transmitted symbols M and linear in the code length (coding gain) G.For large M, the complexity is of order GK per estimated symbol per user, where K is the number of users.The conventional RAKE receiver has complexity GL per estimated symbol per user, where L is the channel length in chips.Hence, the proposed algorithm is not much more complex, and certainly feasible. 2 preliminary version of this paper was presented at SPAWC '03 [23].The present version offers additional simulations and a detailed complexity count.The outline of the paper is as follows.Section 2 gives the data model and describes the blind receiver algorithm from [20].Section 3 derives the proposed algorithms, in both multiuser and singleuser fashions.Section 4 derives the complexity of the algorithms, and Section 5 shows the performance by simulations.Finally, Section 6 gives the conclusions.

Data model
We consider the same data model as in [20].The context is the uplink of a slotted system with K asynchronous users.In a slot, the ith user transmits a vector s i consisting of M i symbols s ik .Each symbol s ik is spread by an aperiodic code (vector) c ik of length G i .After multipath propagation over a channel with length L i chips and relative delay D i (asynchronism), pulse-shaped matched filtering and chip-rate sampling, the receiver stacks the received samples in a slot in a vector y.The contribution of s ik to y is a linear combination of the transmitted signal c ik s ik , plus delays of it, properly scaled by the L i channel coefficients collected in a vector h i , or which is illustrated in Figure 1a.T ik is a Toeplitz matrix whose L i columns consist of shifts of the code vector c ik .Including all K users and the noise, we have where the ih user's code matrix is T i := [T i1 , . . ., T i,Mi ], the channel matrix H is block diagonal with I ⊗ h i as the ith block, vector s is a stacking of all symbol vectors of all users, as illustrated in Figure 1b, and w is a vector representing the additive Gaussian noise.
For convenience, we will usually consider the case of users with equal parameters, but the general case is certainly not ruled out.
In the derivations of the algorithms, we will make the following assumptions.
(A1) The code matrix T is known.This implies that the receiver knows the codes, the user delay offsets D i , and the number of paths L i of all users.(A2) TH is tall and full column rank, which (for users with equal parameters) implies K < G, that is, the number of users is less than the processing gain.We will also require another matrix to be tall (TS in ( 18)), which will imply KL < MG.For initialization using the DRR, we need to require moreover that T is tall and full column rank, which implies KL < G (for users with equal parameters).(A3) The noise w is white Gaussian, with unknown variance The problem we consider is to find, given the code matrix T and the received data vector y, good estimates of all users' source symbols s and all channel coefficients h, where is the stacking of all users' channels h i .

Decorrelating RAKE Receiver (DRR) algorithm
As introduced in [20], the decorrelating RAKE receiver (DRR) algorithm first applies a decorrelating matched filter, or T † = (T H T) −1 T H , to the vector of received data y.This removes all multiuser interference.The output of the decorrelating matched filter is given by where n = T † w is a colored noise vector.The new noise covariance matrix is Since H is block diagonal, the filter output can be separated into individual user contributions.Split u into K segments u i , one for each user, then By unstacking the vector u i into a matrix U i , we obtain the model The channel estimation proceeds by taking a rank-1 decomposition of U i , via a singular value decomposition.The dominant left singular vector is an estimate of h i , and the corresponding right singular vector determines the symbols s i up to an unknown scaling.Since the noise N i is not white, a prewhitening can improve the decomposition [20]; unfortunately, it is not possible to prewhiten each column of U i separately because it would destroy the rank-1 property.
A blind RAKE receiver is obtained in a similar way, but by setting u = T H y in (6).
With an initial channel estimate h (0) obtained in this way, it was also briefly mentioned in [20] that further refinements can be obtained in a two-step iterative fashion, that is, an alternating least squares algorithm similar to the ILSP algorithm [24].Based on (9), (1) given Subsequently round the entries of s (k)   i to the nearest elements of the alphabet; (2) keeping s (k)  i fixed, solve Although this algorithm was proposed in [20], its performance was not shown.

Discussion
To simplify the initial estimation of the channel, the preceding derivation from [20] ignored most of the information on the noise covariance matrix R n , namely the noise correlations among the users, and the symbol-by-symbol temporal correlations.Also the iterative refinement did not take any noise correlation properties into account.Our aim will be to improve the estimation by taking the complete noise model into account.As it turns out, the elegant rank-1 channel estimation property is hard to generalize.However, using the DRR or the blind RAKE to obtain an initial channel estimate, we can improve the estimates by straightforward multiuser twostep iterations, discussed in the next section.

JOINT SOURCE-CHANNEL ESTIMATION
Our derivations will use the following lemma.

Single-user estimation with noise whitening
Consider the single-user model (8).The covariance of the noise n i is denoted by (R n ) i , and is known: where ñi is white noise.Using the lemma, we can now introduce a similar alternating LS algorithm to estimate s i and h i in turns, for each user i separately.
(1) Given h (k−1) i , solve Subsequently, round the entries of s (k)   i to the nearest elements of the alphabet.(2) Keeping s (k)  i fixed, solve In comparison to the original single-user iterative algorithm, the performance is expected to be better, since the noise correlations of the data vector are taken into account.On the other hand, correlations among users are still ignored.Also, the noise enhancement due to the preprocessing with T † is not avoided.

Iterative multiuser estimation
Compared to the single-user estimation algorithms, it is known that joint detection algorithms can achieve significant performance gains, at the expense of increased complexity.We will derive such an algorithm in this section, then verify its complexity in the next section.
Consider the original data model in (2).We can formulate the channel/data estimation problem as a typical least squares problem: find h and s to minimize y−THs 2 , where H = diag(I ⊗ h 1 , . . ., I ⊗ h K ).In the presence of white Gaussian noise, this LS cost function is also optimal in a maximum likelihood sense.
Before we show the iteration, we use the lemma to rewrite the cost function also as a function of h, that is, y − TSh 2 , where The structure of S is shown in Figure 2b.With a good initial channel estimate, say h (0) , we can use the following iteration to improve the estimate.For iteration index k = 1, 2, . . .until convergence, (1) keeping the channel h (k−1) fixed, solve Subsequently, round the entries of s (k)   i to the nearest elements of the alphabet; (2) keeping the source symbols s (k) fixed, solve After the iterations, step 1 is repeated once more to get the final estimate of the source symbols.Assuming the decisions are correct, the algorithm will approach the multiuser Linear MMSE solution with the channel estimated from completely known symbols.
Although written differently, the second estimation step is similar to other batch training-based techniques proposed for long-code CDMA (cf.[5,8]).
As an alternating projection algorithm, it is known that it will converge monotonically to a local optimum.Generally, the algorithm only completely converges after a number of iterations.However, with an initial estimate of the channel provided by the DRR or the blind RAKE discussed in Section 2.2, the algorithm rapidly converges with only one iteration.Because in this formulation the noise is not colored, the final estimates can be much better than those of the initial single-user algorithms that have to work with incomplete noise models.
Apart from this, a second reason why this algorithm is expected to have better performance is that it uses inverses (TH) † and (TS) † of taller matrices, whereas the previous algorithm implicitly worked with H † T † for computing the symbol estimates.While H † T † is a valid left inverse of TH, it is not the minimum-norm left inverse, hence it can give unnecessary noise enhancement.
Another advantage is that the algorithm's performance can still be stable even when T is not tall, that is, in heavily loaded cases.In that case, the algorithm needs to be initialized by the blind RAKE channel estimation algorithm (i.e., use T H rather than T † in ( 6)).

Multiple receive antennas
In the near future, many base stations will be equipped with multiple antennas.We indicate how the two-step iteration have to be modified to take this into account.The multiantenna version for DRR was shown in [20].
Consider a case where d receive antennas are used.No structure is imposed on this antenna array.Let y j , H j , and w j be the received vector, the channel matrix, and the noise vector for the jth antenna, respectively.Applying the identity TH j s = TSh j , we have the two versions of the data model where h j is the stacking of all channel vectors for the jth antenna.
In the first step of the iterative algorithm, where source symbols are estimated from known channel vectors using (19), we need to apply the inverse of [(TH 1 ) T (TH 2 ) T • • • (TH d ) T ] T to the data vector.Since this matrix is d times taller than before, its conditioning is expected to be much better so that the estimation of s is significantly improved.In the second step, estimating the channels from known source symbols using (19), the matrix to be inverted, I d ⊗ (TS), has the same conditioning as the matrix (TS) in the single-antenna case.Actually, each channel is estimated independently from the source symbols, which means that no gain is obtained in this step.However, since the symbols are estimated at higher accuracy, the overall performance improvement over the single antenna case is significant, even after only one iteration.

COMPUTATIONAL COMPLEXITY
In this section, the computational complexity of the two-step iterative algorithm is discussed.In summary, one iteration of the algorithm consists of the following steps: (1) given the channel coefficients h, estimate the source symbols s by solving y = THs + w, (2) with known source symbols s, estimate the channel coefficients h by solving y = TSh + w.
For simplicity of expressions, all users are assumed to have equal parameters.We compute the complexity of three implementations: a direct one, one that exploits the sparse structure of T (many zero entries), and one that uses this sparse structure and the fact that the nonzero entries occur in bands.

Direct computation
T has size GM × MKL, whereas H : MKL × MK and S : MKL×KL.Therefore, computation of and similarly computation of T := TS (size The computation of ŝ := (T ) † y can be implemented in two ways.
(1) Via (T H T ) −1 •T H y. The computation of T H •T costs order-GM(MK) 2 operations, inversion of this matrix costs (MK) 3 operations, computation of T H y costs GM • MK operations, application of (T H T ) −1 to this vector costs another (MK) 2 .The total cost is of order Computation of the QR-factorization costs order GM(MK) 2 , computation of v costs order GM • MK, backsubstitution costs order (MK) 2 .The total cost is of order Similarly, the complexity of ĥ = (T ) † y is, (1) via (T H T ) −1 • T H y, order GM(KL) 2 + (KL)3 , (2) via QR-factorization of T = QR, order GM(KL) 2 .

Computation using sparse structure of T, H, and S
In the direct computation, we did not recognize the fact that many entries of T, H, and S are zero.Each row of T has only KL nonzero entries, whereas H and S are block diagonal and a permutation of a block-diagonal matrix, respectively.Exploiting this, the computation of T := T • H costs order-GMKL operations, and also the computation of T := TS costs order GMKL.In the latter case, we can also recognize the fact that these are integer operations (the entries of T and S are typically ±1 or some other finite alphabet).
In the computation of ŝ := (T ) † y using the sparse structure of T , we cannot use the technique via QR-factorization because it destroys the structure.Each row of T has only K nonzero entries, each column has G nonzero entries.Via (T H T ) −1 • T H y, the computation of T H • T costs order-G(MK) 2 operations, inversion of this matrix still costs (MK) 3 operations, computation of T H y costs GMK operations, and the application of (T H T ) −1 to this vector costs (MK) 2 .The total cost is of order G(MK) 2 + (MK) 3 .
Unfortunately, this direct computation cannot use backsubstitution, hence the complete matrix (T H • T ) −1 is formed even if it is applied only to a single vector.There are iterative techniques (e.g., conjugate gradient, cf. the channel estimation techniques reported in [5,7]) that compute an approximation to the result; they have complexity of order (MK) 2 .The total complexity would then be G(MK) 2 + (MK) 2 , or of order G(MK) 2 .
In the computation of ĥ = (T ) † y, no advantage is obtained because T is a full matrix.We can recognize, however, that T has integer entries, hence computation of (T H T ) −1 costs order α(KL) 2 , where α is the complexity of adding GM integer numbers.If approximate iterative techniques are used for applying the inverse, then the total complexity becomes order (KL) 2 .This is similar to the complexity of the channel estimation step in [5,7]. 3

Computation via time-varying state space representations
A matrix-vector multiplication y = Tu can be regarded as a time-varying system T, which has input signal u and produces y as the output.Such a system can be realized using time-varying state space equations, where x n is a state-vector that carries information from one stage to the next.This representation shows in some more detail how the entries of y = Tu are computed one-by-one.
A complete theory based on this can be found in [21].In [20], this theory was applied to the efficient inversion of the code matrix T in the current application.Essentially, efficient computations are possible because T has many zero entries and they occur in bands, a result of the FIR channel assumption.Therefore, the channel inversion can have a lower complexity: the QR-factorization, application of Q H , and application of R −1 via backsubstitution can all be done using the state space realization. 4It is also shown that the realization of T has GM stages, and in the nth stage, [C n , D n ] are directly specified in terms of the nonzero entries of the nth row of T, whereas [A n , B n ] are shift matrices (similar to identity matrices).
Without repeating the derivations of [20], we mention the resulting complexities.Computation of a state space realization of T = T•H costs order-GMKL operations, and the result is a realization with GM stages, each with K nonzero entries.Computation of the QR-factorization of T costs GMK 2 operations, and applying Q H or R −1 to a vector via backsubstitution costs GMK operations.In total, the complexity is of order-GMKL + GMK 2 operations.This is a factor M less than in the preceding section, even if here the exact solution is computed.
In the computation of ĥ = (T ) † y, no specific advantage of using state-space realizations is obtained because T is not sparse.In this case, the complexity of the preceding section will be assumed.

Summary
The preceding complexities are summarized in Table 1.For K > L, the dominant term in the complexity is of order GMK 2 , contributed by the symbol estimation step.Per estimated symbol per user, the complexity is GK.This can be compared to the complexity of a RAKE receiver (computing u = T H y), which is GMKL, or GL per estimated symbol per user.This suggests that the two-step algorithm does not cost much more, hence is feasible to implement in practice.If K < L, the dominant complexity is GMKL, of the same order as for the RAKE.
To put this in further perspective, we mention the complexity of a few other proposed algorithms.The Bayesian approach in [17] has a complexity of GL 2 per symbol per user per iteration (about 50-100 iterations are needed).The Kalman filter receiver structure in [25] requires GKL 2 per symbol per user, a known channel is assumed.The reported complexity of the approach in [13] is G 2 L 2 per user, for the channel estimation step only.

SIMULATION RESULTS
Simulations are used to compare the proposed algorithms to the blind RAKE receiver and the DRR.We simulate a Only a single iteration of the two-step algorithm is used.The well-known phase ambiguity problem in blind estimation is easily solved by using a single training pilot symbol or by differential encoding.

Channel estimation mean square error comparison
The channel mean square errors (MSEs) of the various algorithms are compared for varying signal-to-noise ratio (SNR), that is, the energy per bit divided by the noise power (E b /N 0 ).The reference curve is the linear MMSE receiver with known source symbols.Figure 3a shows the results.It is seen that the proposed iterative algorithms (multiuser estimation, either initialized by DRR or RAKE) have significant gains over the DRR and especially over the conventional RAKE receiver.When the SNR is sufficiently high (SNR > 9 dB), their performance is almost the same as the ideal linear MMSE receiver (computed from known symbols) with gain of about 7 dB over the DRR.
When the noise is strong, the proposed algorithm initialized by RAKE seems to be a better candidate than the one with DRR as the initial estimate.This is attributed to the noise enhancement of T † , since T is not very tall.Consequently, as the SNR increases, the gap between the two curves reduces quickly to zero.
In addition, the iterative single-user estimation version of the proposed algorithm also has a good performance with gain of about 2 dB over the DRR.However, separate simulations showed that the noise whitening did not give any improvement in MSE over the unwhitened iterative DRR (its curve is not shown for clarity).
Figure 3b shows how the algorithms' performance changes with respect to the number of users (K) while the SNR is kept fixed at a moderate level, 10 dB.When K is small, the proposed curves are nearly identical to the MMSE receiver.Since DRR requires T to be tall, the maximal number of users for DRR is given by K 0 = G/L .When approach-ing this limit (K ≈ 7 to 8 so that T is barely tall), the performance of DRR starts to deteriorate: the conditioning of T becomes poor and T † will significantly amplify the noise.The two-step algorithm initialized by DRR still has a good performance.However, when K ≥ K 0 = 10, its performance degrades drastically while the algorithm initialized by RAKE still maintains a good performance.Its curve gradually detaches from the MMSE curve as K increases.
It can be interpreted from the preceding results that our proposed multiuser algorithm converges rapidly, and even a single iteration can have significant improvement in channel estimation, which is comparable to the linear MMSE receiver.Moreover, the proposed algorithm is rather independent of the initial estimate when the system is not heavily loaded.When the number of users K becomes critical, initialization by the blind RAKE is the preferred choice because it does not suffer from sudden noise enhancement.

Bit error rate (BER) comparison
We next study the BER performance of the various algorithms.The reference curve indicates the performance of the linear MMSE receiver based on true channel coefficients.Figure 4a corresponds to Figure 3a and shows that the multiuser version of the proposed multiuser algorithm has significant improvement over the DRR.The gain is approximately 4 dB at BER= 10 −2 , and slightly increases when the BER decreases.The single-user noise-whitened iterative version, despite its rather good performance in channel estimation, is only slightly better than its corresponding DRR (the gain is about 1 dB).Without noise whitening, however, the BER results of the original iterative algorithm in Section 2.2 were slightly worse than the noniterative DRR (curves not shown for clarity), therefore, the whitening step is advisable.
The proposed multiuser algorithm seems to have the same BER when the SNR is high enough, independent of its initialization by the DRR or by the blind RAKE.However, when the noise is strong, the iterations initialized by RAKE have a slightly better performance because they do not suffer from noise enhancement in case T is not tall.
Finally, Figure 4b shows the performance of the multiple antenna versions of each of the proposed algorithms.Compared with the corresponding MMSE receiver, the performance gap is wider than in the single-antenna case.This is in accordance with our discussion in Section 3.3.

CONCLUSION
We have derived a multiuser joint source-channel estimation for long-code CDMA, which is the combination of the blind (decorrelating) RAKE receiver with an iterative symbol/channel estimation algorithm.The algorithm shows a significant improvement over the decorrelating RAKE receiver and the conventional RAKE receiver.The gain is especially impressive in heavily loaded systems, even if the noise is strong.Using time-varying state space realizations, we showed that the proposed algorithm can be efficiently implemented, especially if the number of symbols in a slot is relatively large.Per estimated symbol per user, the complexity is of order GK, whereas the complexity of a RAKE receiver is GL, where G is the code length, K the number of users, and L the channel length in chips (assuming K > L and the number of symbols in a slot sufficiently large).Thus, the proposed scheme has a complexity that is similar to that of the RAKE receiver.

Figure 1 :
Figure 1: (a) Effect of a single transmitted symbol on the received data vector y, (b) structure of the code matrix T, channel matrix H, and symbol vector s.

Lemma 1 .
Let h and s be vectors of length L and M, respectively.Then (I M ⊗ h)s = (s ⊗ I L )h.Proof.Using the multiplicative property of Kronecker products, (A ⊗ B)(C ⊗ D) = (AC ⊗ BD), we immediately obtain

Figure 2 :
Figure 2: Structure of (a) matrix H and (b) matrix S.

Table 1 :
Computational complexity of the two-step iterative algorithm.