Blind Synchronization in Asynchronous UWB Networks Based on the Transmit-Reference Scheme

Ultra-wideband (UWB) wireless communication systems are based on the transmission of extremely narrow pulses, with a duration inferior to a nanosecond. The application of transmit reference (TR) to UWB systems allows to side-step channel estimation at the receiver, with a tradeo ﬀ of the e ﬀ ective transmission bandwidth, which is reduced by the usage of a reference pulse. Similar to CDMA systems, di ﬀ erent users can share the same available bandwidth by means of di ﬀ erent spreading codes. This allows the receiver to separate users, and to recover the timing information of the transmitted data packets. The nature of UWB transmissions—short, burst-like packets—requires a fast synchronization algorithm, that can accommodate several asynchronous users. Exploiting the fact that a shift in time corresponds to a phase rotation in the frequency domain, a blind and computationally e ﬀ cient synchronization algorithm that takes advantage of the shift invariance structure in the frequency domain is proposed in this paper. Integer and fractional delay estimations are considered, along with a subsequent symbol estimation step. This results in a collision-avoiding multiuser algorithm, readily applicable to a fast acquisition procedure in a UWB ad hoc network. Copyright


INTRODUCTION
Impulse radio (IR) ultra-wideband (UWB), further on simply called UWB, has recently been proposed as a system that can provide high data rate communications (up to 100 Mbit/s) on short distances (order of 10 m).Exploitation of the bandwidth of at least 500 MHz induces a great number of issues in the transceiver design and signal processing (see [1] for a recent overview of UWB signal processing and communications challenges).
The classical transceiver schemes use the data signal in order to modulate a carrier, that is, the spectrum of the data sequence is shifted from the baseband to a higher carrier frequency.Reversely, in UWB systems a carrier-less approach is employed.The information is conveyed by modulation of temporal pulses of extremely short durationless than a nanosecond.As a consequence, the spectrum of the UWB signal is covering an extremely large frequency band.To allow for coexistence with already deployed narrowband communication systems such as GSM, GPS, and WLAN, the energy of the emitted UWB pulses needs to be lowered to the noise level.In addition, generation of the pulses is an extremely low-complexity and low-power operation [2] and therefore facilitates the accomplishment of low-cost transmitter devices.All these features make impulse radio attractive for high data rate, short distance, and multiuser personal area networks (PAN).
Propagation of a temporally narrow pulse (in the order of a nanosecond), also known as a monocycle, results in a channel impulse response that is much longer than the duration of the pulse itself and that has a large number of delay taps [3].As the channel resolution is inversely proportional to the bandwidth of the signal, differences in path delays or path lengths of 1 ns and 30 cm, respectively, can be resolved [4,5].The resulting low probability of multipath fading permits a larger amount of transmitted energy to be collected at the receiver.
The large bandwidth of UWB signals allows to accommodate multiple simultaneous users.The most common modulation scheme that facilitates coexistence of multiple users in UWB systems is time-hopping pulse position modulation (TH-PPM) [6].A monocycle is considered to be a part of a longer time interval defined as a frame.To avoid collisions due to multiple access, each user is assigned a random timehopping code and shifts his monocycles within frames according to it.
The correlation receiver is considered to be the optimal receiver if the TH-PPM modulation scheme is used.Initially, the channel impulse response has to be estimated and convolved with the known user code to obtain the template applied in the correlation process with the received signal.Performing an exhaustive search over different delays, averaging over several data symbols and finally searching for the maximum of the recollected energy function provide the estimate of the packet offset.Note that for this kind of receiver the knowledge of the channel is required.Some authors propose the implementation of a RAKE receiver to obtain the estimate of the channel but taking into account the current state of technology, we consider this approach unsuitable for implementation in a low-cost UWB transceiver because of the high computational complexity and high sampling rates.
A way to avoid channel estimation is in the implementation of a transmit-reference transmission scheme introduced already in [7] and revived by Hoctor and Tomlinson [8,9].The idea consists of the transmission of two pulses (doublet) one after another, where the first pulse is used as the reference for the second pulse which is modulated with data (polarity of the pulse corresponds to data symbols {−1, +1}).Both pulses undergo the same multipath channel.At the receiver, the reference pulse is delayed and correlated with the data pulse allowing to recollect the energy which was spread by the channel.The effective data rate is thus reduced by 50% but the receiver sampling rate and complexity are highly reduced because the correlation and integration steps are done in the analog part of the receiver.
Taking into account the fact that the UWB signal is transmitted at low-power level (comparable to the noise level), further performance improvements by suppressing the noise are possible as proposed in [10].In [11], an advanced noiseless data model for a specific TR-UWB receiver was derived, taking into account that the channel has a long impulse response.The extension of this data model to the noisy case is presented in [12].
The work in [11,12] represents the starting point for the present paper.In contrast to [11,12], we consider finite data packets with an unknown time offset.The particular structure of the TR-UWB scheme only requires a synchronization at the chip-level, which is an easier problem than synchronization for more traditional pulse-based UWB schemes where the starting point of a very narrow pulse has to be found.Hence, the blind synchronization problem considered in this paper is to find the known user code at an unknown offset, which is an extension of a similar problem considered in CDMA, now for a more complicated data model.In particular, we propose an extension of the blind channel estimation algorithm for CDMA proposed by Torlak and Xu [13].The received data samples are stacked in a matrix such that a shifted version of the user specific block code is in its column span.Subsequently, we exploit the fact that a shift in time corresponds to a phase rotation in the frequency domain.Finally, a MUSIC-like search for a shift invariant vector in the signal subspace provides a high resolution delay estimate.

Notation
T denotes the matrix transpose, H the matrix complex conjugate transpose, † the matrix pseudoinverse (Moore-Penrose inverse).I (or I p ) is the (p × p) identity matrix.0 (or 0 p×q ) and 1 (or 1 p×q ) are (p × q) matrices for which all entries are equal to 0 and 1, respectively.For a vector, diag(v) is a diagonal matrix with the entries of v on the diagonal.⊗ is the Kronecker product.vec(A) is a stacking of the columns of matrix A into a vector.

Single doublet
In the TR-UWB scheme presented in [11,14], pulses g(t) are transmitted in pairs (doublets) which are mutually separated by a delay D i , i = 1, 2, . . ., M, where M represents the total number of delays used.Besides, we assume that The first pulse is fixed and represents the reference, whereas the second one is modulated with the data.In the sequel, we first describe the data model for the synchronous single doublet transmission.In addition, we outline the parameters that arise as a result of the deployment of the specific correlation receiver derived in [11,12,15].
Consider the transmission of a single doublet d(t), where g(t) represents a reference pulse while c • s • g(t − D i ) is a data modulated pulse, with scalars c = {±1} and s = {±1} representing a polarity code and a data symbol, respectively.Accordingly, the sign of the data modulated pulse is defined by the value of c • s = {±1}.We assume that a doublet is placed within a frame of length T d and that a constraint holds, where T h stands for the duration of the channel impulse response.This condition implies that the pulses of a doublet affected by the channel fade out completely within a single frame after correlation (see Figure 1).In this manner, the existence of interframe interference in case of multiple doublet transmission is prevented (see Figure 4).After propagation through a long convolutive channel, the signal at the output of the receiver antenna can be written as where h(t) = g(t) h p (t) represents the overall channel impulse response obtained as the convolution of the transmitted pulse g(t) and the channel impulse response h p (t).Note Relja Djapic et al. that the latter comprises the effects of the transmit and receive antennas together with the wireless propagation channel.
Since both pulses of a doublet undergo the same channel, one is used as a "matched filter" for the other one at the receiver.This is the principle behind the autocorrelation receiver depicted in Figure 1 [11,14].The received data r(t) is delayed over all possible delays D m , m = 1, 2, . . ., M, and correlated with the original nondelayed signal.Finally, integration with a sliding window of width W at the mth receiver branch yields Let us now introduce the channel correlation function Assuming W ≥ T d > T h , we can write ψ(t, Δ) = b(t)ρ(Δ).
Here, ρ(Δ) = ∞ 0 h(τ)h(τ − Δ)dτ depicts the energy recollected in the correlation process.It is maximized for Δ = 0 and is, in general, nonzero for Δ = 0 (see [15]).Further, b(t) has a brick shape and can be written as Note that in the regions 0 ≤ t < T h and W < t ≤ W + T h , b(t) depends on the particular channel realization but it is approximated by a linear rising and decaying slope, respectively.
If we now assume that W D M , the output of the mth integrator (4), in case delay D i is used at the transmitter and D m at the receiver, becomes [15] x m (t where α m,i = ρ(D m − D i ) + ρ(D m + D i ) and β m = 2ρ(D m ) represent the unknown channel correlation coefficients that are real numbers corresponding to a gain and a DC offset, respectively [15].While the gain depends on both the transmitter delay D i and the receiver delay D m , the DC offset only depends on the receiver delay D m .Moreover, both α m,i and β m depend on the particular correlation properties of the channel, as indicated in [11].Note that although α m,i is generally maximal if m = i, some residual information remains when m = i, as an effect of the channel correlation.Figure 2(a) depicts the response of the system to a single transmitted doublet for the channel impulse response of Figure 2(b).The latter is obtained from a measurement campaign performed in a typical university building [16].The spacing between the pulses at the transmit side is chosen to be D 3 = 2.1 ns, the data symbol is s = +1, and the polarity code is c = +1.At the receiver side three delay branches m = 1, 2, 3 are deployed where D m = {0.7 ns, 1.4 ns, 2.1 ns}.Deploying a sliding window integration that is three times wider than the frame width, that is, W = 3T d = 180 ns produces the signal x m (t) with a nonzero support in the range [0, W + T h ].In our case the length of the channel is T h = 50 ns.The solid line depicts the signal at the output of the third receiver branch for the matched transmit and receive delays Δ = D m − D i = 0. Signals for the nonmatching delays D m = D i are depicted by dashed and dash-dotted lines.

Single chip transmission
According to the spectral regulations, the UWB signal needs to be transmitted at very low-power level.In order to be able to extract the useful information at the receiver side some kind of spreading gain needs to be introduced.The most simple approach is to repeat several, say, N d frames of total duration T c = N d T d .Define such a sequence of frames as a chip in which the spacing between pulses (D i ) and the polarity of the information pulses (c • s) remains unchanged.In such a case the transmitted signal t x (t) is given by The signal at the output of the mth receiver branch is computed as the superposition of the contributions of N d doublets The function p(t) represents a typical response of the sliding window integration process for a case in which a single chip is considered.In general, p(t) has a staircase tent shape and is modeled as where b(t) is the brick shape function defined in (6).Note that since b(t) depends on the particular channel realization, so does p(t).
In Figure 3 the signal x m (t) at the integrator outputs is represented.A transmission of a single chip containing three doublets T c = 3T d is taken into account.The strongest signal is obtained for matching transmit and receive delays (solid line).Dashed and dash-dotted lines depict the cases in which a delay mismatch occurs (D m = D i ).In these cases, the signal is nonzero due to the effect of channel correlation coefficients α m,i and β m .Note that even though the transmitted chip is T c wide, the deployment of a sliding window integration of width W = T c expands the nonzero support of the signal at the receiver side to the [0, 2T c ] region.

Transceiver design for asynchronous multiple symbol transmission
In this section, we build a data model for the asynchronous transmission of multiple data symbols.As UWB systems cover a large frequency band and in order to avoid catastrophic collisions in multiuser scenarios, the broadcasted signal is spread by means of the polarity and time-hopping codes.As described in Section 2.1 the basic information unit is a frame of duration T d .Further, N d frames represent a chip of duration T c = N d T d , and N c chips represent a data symbol of duration T s = N c T c .The jth chip of the kth data symbol is modulated by s k c j , where s k ∈ {±1} represents the data symbol sequence and c j ∈ {±1}, j = 0, 1, . . ., N c − 1, represents the polarity code.The value of the delay D i is constant within the jth chip but changes from chip to chip according to the so-called time-hopping code J i, j , i = 1, 2, . . ., M, j = 0, 1, . . ., N c − 1, which is 1 if the delay D i is used for the jth chip and 0 otherwise.To summarize, the transmitted sequence can be written as An example of a transmitted pulse sequence for a single symbol is presented in Figure 4. Hence, we can write the received sequence as Note that we consider no additive noise throughout this work, in order to simplify the presentation.However, all the simulations will be carried out in the presence of noise.
The output of the mth receiver branch in an asynchronous single user scenario is then modeled as The structure of a transmitted UWB signal.The data symbol is set to s 1 = +1.The polarity (CDMA) code vector comprises three T .The latter means that the transmit delays D 2 , D 1 , and D 3 are used for the 1st, 2nd, and 3rd chips, respectively.where τ represents an unknown delay of the received data signal with respect to the beginning of the analysis window, which we try to estimate in this work.Since short polarity and time-hopping codes (c j and J i, j ) are considered and symbols are assumed unknown in a first stage, we may restrict τ to the interval τ ∈ [0, T s ).An example of the expected behavior of the signals at the output of the integrators is presented in Figure 5. Solid lines represent the integrator output for matched transmit and receive delays (D m = D i ) while dashed lines depict the residual information for nonmatching delays D m = D i .The overall signal at each receiver branch is obtained as the sum of the matched and nonmatched delay contributions (sum of solid and dashed lines).
The bandwidth of x m (t) is of the same order of magnitude as the chip rate, which is significantly smaller than the transmission bandwidth.Hence, at this point, it is realistic to introduce sampling and switch to the digital domain.Let us sample x m (t) at rate P/T c , where P is the oversampling factor.
The sampled signal can then be written as where p n, j = p(nT c /P− jT c −τ).The crucial observation now is that if we sample once per frame, that is, if we sample at rate P = N d , p n, j may be observed as a sequence of samples of a perfectly known triangular pulse shape (see dash-asterisked line in Figure 3).As a result, p n, j is completely known if τ is an integer multiple of T c /P.This fact is exploited in the process of estimating an arbitrary offset τ as presented in Section 3.
We generally stack the N c P samples x m,n , n = kN c P, kN c P + 1, . . ., (k + 1)N c P − 1 together in the N c P × 1 vector and stack the M vectors x m,k , m = 1, 2, . . ., M, together in the N c P × M matrix We now first introduce a matrix model for a single transmitted data symbol, and then generalize this to multiple transmitted data symbols.

Single transmitted data symbol-matrix model
Suppose only the kth symbol is transmitted.If we then stack vertically X k and X k+1 , we obtain as in [17] the following matrix model for a single transmitted data symbol: where A and b are the M×M matrix and M×1 vector defined as [A] m,i = α m,i and [b] m = β m , respectively.As mentioned before, they depend on the correlation properties of the channel.It can be shown that A is symmetric, approximately Toeplitz, and diagonally dominant with positive entries on its main diagonal.The N c × 1 vector c = [c 0 , . . ., c Nc−1 ] T is the known polarity code vector.The matrix J of size M × N c is a known selector matrix which has a single unit element per column (chip), which determines the transmitter delay for that column (chip), or more specifically, [J] i, j+1 = J i, j .Finally, P is the 2N c P × N c block-Toeplitz matrix whose columns are shifts of p n, j , or more specifically, [ P] n+1, j+1 = p n, j .
Let us now split τ in an integer delay κ and a fractional delay as τ = κT c /P+ +T c /(2P), where κ ∈ {0, . . ., N c P−1} and ∈ [−T c /(2P), T c /(2P)) (the additional offset T c /(2P) is included to force the interval for symmetric around 0).This allows us to write P as where K = (N c − 1)P and P is the (N c + 1)P × N c block-Toeplitz matrix with entries given by [P] n+1, j+1 = p(nT c /P − jT c +T c /(2P)− ), that is, it only depends on (see Figure 6).In other words, if we only focus on coarse synchronization, we may assume that = 0 and thus that P is known.
As a result, we can rewrite (17) as where Z := P diag(c)J T is a (N c + 1)P × M code matrix, which is known if = 0, and q := P01 Nc ≈ 1 (Nc+1)P is a known (N c + 1)P × 1 offset vector.The approximation q ≈ 1 (Nc+1)P follows from the structure of the P matrix.The channel parameters A and b as well as the data symbol s k are unknown.The representation of the block matrix structure for a single symbol is depicted in Figure 7.
Observe that the presented data model resembles a conventional data model for DS-CDMA, up to the channel gain (A) and DC offset (b) term of the channel correlation.This will allow us to use synchronization methods similar in spirit to the DS-CDMA synchronization methods.But before we introduce these synchronization methods, we first generalize the above model to a data model for multiple transmitted data symbols.

Multiple transmitted data symbol-matrix model
When transmission of multiple data symbols is considered, intersymbol interference (ISI) arises due to the implementation of the sliding window integration.Generally two data symbols affect a single block of received data X k .Therefore, stacking X k and X k+1 vertically, we can modify (19) to the Analysis window The structure of the analysis window for an asynchronous TR-UWB scheme.
following matrix model: The block columns Z τ , Z τ , and Z τ , all of size 2N c P × M, comprise the effects of the polarity and time-hopping codes as well as the effect of the pulse shape p(t).We begin with defining the second block column Z τ , which is similar to the first block column of (19): κ×M , Z T , 0 T (K−κ)×M ] T .This block column comprises the complete version of a user specific code matrix Z = P diag (c)J T , which is known if = 0, shifted by an integer delay κ.The other two block columns Z τ and Z τ can be defined as They contain only part of the user block code Z. Z , with size (N c P − K + κ) × M, and Z , with size (N c P − κ) × M, depict the effect of a "previous" and a "subsequent" data symbol, respectively.It is thereby crucial to observe that Z = [Z T Z T ] T .Writing (20) in a more compact form, we obtain where Let us now define a received data matrix X as where n is the length of the analysis window over which data is collected.Using (21), we can write this matrix as where S = [S 0 , . . ., S n−1 ] (see also Figure 9).The structure of the received data blocks for multiple transmitted symbols is depicted in Figure 8.
In the case where the analysis window is not within the transmitted packet, we can use the same model but allow some of the symbols s k to be zero (note that 1 T  n ⊗ b T will also change in that case).

Optimization problem
We now describe the synchronization algorithm.In Figure 8 the relation between the received data at the integrator outputs X k and the transmitted symbols is presented.We describe a block algorithm that provides an estimate of the delay τ, and also allows us to estimate the data symbols s k .
The algorithm is an extension of the algorithm of Torlak and Xu [13], who considered blind channel estimation for DS-CDMA using subspace techniques.The idea is to use the property that the matrix G is orthogonal to the left nullspace (U 0 ) of the matrix X, that is, U H 0 G = 0. We can use this relationship in order to find an estimate of τ.More specifically, we solve where u (i) 1 and u (i) 2 are both of size N c P ×1 and depict the first and the second halves of the ith column of U 0 , respectively.Z 1 and Z 2 are of size N c P × M and represent the upper and lower halves of the middle block column of G, that is, T .We now aim to transform (24) without changing the criterion, in order to bring out the block column Z τ , containing the user specific code matrix Z, which is known if = 0, shifted by an integer delay κ.Restacking (24) as in [13] yields arg min Here, i sweeps all the vectors from the left null space of X.By stacking horizontally U (i) for all possible i's we get the matrix U 0 .Now (25) can be written as At this point, let us make a distinction between integer delay estimation and noninteger delay estimation.

Integer delay estimation
We first assume that = 0, and focus on the estimation of the integer delay κ.As already mentioned before, if = 0, the matrix P and thus the matrix Z are completely known.As a result, Z τ , which can then be written as Z κTc/P , only depends on κ and we can rewrite (26) as This can be solved by performing an exhaustive search over κ ∈ {0, . . ., N c P − 1}, since we know Z κTc/P up to the integer delay κ.
Since the above time-domain approach is rather computationally intensive, we switch to a much simpler frequencydomain approach, recognizing that an integer shift in the time domain corresponds to a phase shift in the frequency domain.More specifically, we can write Z κTc/P as As n 1 where F stands for the 2N c P × 2N c P normalized discrete Fourier transform matrix, D τ represents the 2N c P × 2N c P diagonal matrix given by

As
and Z 0 is a completely known 2N c P × M matrix.If we now denote z (l)H 0 as the lth row of Z H 0 , and define z (l) 0 := Fz (l) 0 , U 0 := FU 0 , and φ τ = diag (D τ ), we can rewrite (27) as Due to the structure of φ κTc/P that corresponds to the (κ+1)th column of the FFT matrix F, searching for the φ κTc/P that minimizes the last expression is equivalent to performing an inverse FFT (IFFT) on the matrix K and searching for the row of the resulting matrix that has the lowest norm.The index of the row with the lowest norm determines the integer delay κ.Note that through the use of the (I)FFT this frequency-domain approach is much simpler than the earlier developed time-domain approach.However, since we have assumed = 0, the resolution of this algorithm is limited by the sampling period T c /P.This problem will be treated in the next section.
In order to compare the computational complexity of the integer delay estimation carried out in the time and frequency domain, we compute the number of multiplications needed in both cases.The time-domain search requires O(2M 2 (N c P) 4 ) multiplications, in contrast to O(M 2 (2N c P) 2 log 2 (2N c P)) multiplications in case the proposed frequency-domain search is employed.

Noninteger delay estimation
Let us now consider the more general case, where = 0. We can then actually proceed as in the previous section, by observing that if the sampling rate is close to the Nyquist rate, we can also express a noninteger shift in the time domain by a phase shift in the frequency domain.In other words, we can extend (28) for the noninteger delay case to Following similar steps as in the previous section, we can then transform (26) to arg min As before, we can first look for an integer delay κ by computing the IFFT of K and searching for the row of the resulting matrix that has the lowest norm.The fractional delay is then obtained by performing an additional fine grid MUSICkind search around κT c /P: The overall delay estimate is finally given by τ = κT c /P + .

Symbol estimation
After estimating the delay τ, we can reconstruct the complete G matrix.Estimation of the transmitted data symbols is now possible by performing a deconvolution of the matrix X using the known user code, that is, we compute where † denotes the pseudoinverse.We subsequently limit our attention to the middle block row of S, name it S as the part that carries most of the energy (see also Figure 9).The data symbols at this point can be estimated from S in two different ways [17]: (i) by computation of the trace of the M × M data blocks As k , or (ii) by vectorizing the M × M data blocks As k and stacking the results column-wise into a matrix, such that we get a rank one matrix whose row span corresponds to a scaled version of the data symbols.In both cases, the estimates can be further refined by iterations [12].

EXTENSION TO THE MULTIUSER CASE
In this section, we extend the previous ideas developed for a single user to multiple users.Let us start by extending the data model of Section 2.5 to multiple users.This is not trivial, since next to the autocorrelation terms of the different users, there are also crosscorrelation terms, due to the use of the autocorrelation receiver.However, since different users employ distinct time-hopping and polarity codes, propagate through different channels, and arrive at the receiver at random time instants, we can treat these cross terms as additive white noise, and add them to the other noise terms that might be present.As before, we do not take the additive noise terms into account in our derivations, but we do include them in our simulations.As a result, indicating the user index by means of a superscript q (q = 1, 2, . . ., Q), we can write the received data block X as S (1)  . . .
where S (q) = [S (q) 0 , . . ., S n−1 ].Note that in the case some users are not active for the duration of the whole analysis window, several S (q) k matrices will be zero and some small changes in the structure of 1 T n ⊗ Q q=1 b (q) T will occur.Consequently, a few additional vectors with low energy may emerge in the left signal space.

Identifiability for multiuser case
In the multiuser (MU) case as presented in (35), the matrix ] is of size 2N c P × (3MQ + 1), whereas the matrix comprising all data blocks and offset effects, T , is of size (3MQ + 1) × Mn.In order to determine the column space of G MU from X (and hence its left nullspace), G MU should be tall and of full column rank, that is, 2N c P > 3MQ + 1, and S MU should be fat and of full row rank, that is, 3MQ + 1 < Mn.Note that a full column rank G MU is also required to subsequently detect the data symbols.From the first condition, a limit on the code size is obtained: N c > 3MQ/2P, which for typical values of P = 2 and M = 4 yields N c > 3Q.The condition on the size of S MU gives the relation between the number of users Q and the lowest number of symbols transmitted n, that is, Q < (Mn − 1)/3M.

APPLICATION IN UWB NETWORKING
The ability to achieve high resolution packet offset estimation in a multiuser environment in a fast and computationally simple way is of crucial importance for the subsequent data symbol estimation step.Imagine the scenario of a UWB ad hoc network where users need to exchange their codes at the moment they join the network.The simplest way to solve this problem is to implement a common code known to all the users in the initialization phase.In existing wireless network protocols a data packet is considered to be lost if several users simultaneously use the same code which is known as the packet collision problem.Nevertheless, the structure and the design of the considered TR-UWB scheme will allow us to avoid the collision problem.In TR-UWB systems, different users propagate through different channels creating distinct correlation matrices A (i) .This can be viewed as an additional coding introduced by the channel itself and can be adopted to solve the collision problem, as illustrated next.
Consider a two-user system where both users adopt the same spreading code.The data model (35) then becomes The synchronization of both users to the common code and the subsequent data detection is in general only possible if τ i = τ j for i = j and by implementation of a common code that has a low autocorrelation property.
But even if the two users completely overlap in time it is still possible to separate both overlapping users and detect their data sequences.In that case the linear dependency between G τ1 and G τ2 reduces the rank of the code matrix, that is, [ . As a consequence, data blocks S (1)  and S (2) merge into a single block S = S (1) + S (2) changing (36) to Estimating the packet offset delay τ 1 = τ 2 , we can reconstruct G τ1=τ2 and subsequently as in (34) obtain an estimate of the data matrix S = S (1) + S (2) .Considering only the mid-block row of S = S (1) + S (2) (see Figure 9) we get S = S + S , which can be modeled as Performing the vectorization of each M×M block of S yields vec A (1) s (1)  1 + A (2) s (2)  1 , . . ., vec A (1) s (1)  n + A (2) s (2)   n = a (1) a (2)   ⎡ ⎣ s (1)   1 where a (i) = vec(A (i) ).A singular value decomposition of S produces a rank-two decomposition and is an indication of the existence of two overlapping users.Now, the column vectors (a (1) , a (2) ) and the data symbols ({s (1)   k }, {s k }) can be estimated from the column and row span of S .This approach fails only in the case when A (1) = γA (2) where γ is a scalar, but this has an extremely low probability of occurrence.

Single user case
The performance of the proposed algorithm is first tested for a single user in noise.Signals are generated in accordance to the description provided in Section 2. Two hundred and fifty Monte Carlo runs are performed for fixed polarity and timehopping codes.Data symbols and noise are varied in each site and nonline of site channel impulse responses are covered in this fashion (see [16]).A sampling rate of 10 ps is used in the channel measurements.We limit the measured channel impulse responses to the interval [0, 50] ns, as the contributions of the channel components that fall outside this interval are insignificant.The duration of the frame is chosen to be T d = 60 ns.The energy of a single transmitted data symbol (bit) is defined as where h(t) represents the total channel impulse response, including the monocycle and the transmit and receive antenna effects.N 0 is the power spectral density of the white Gaussian noise which is added after the receive antenna.The E b /N 0 is changed in steps of 1 dB.After the white Gaussian noise is added, a bandpass filtering is performed to limit the bandwidth of the observed signal to the interval 4-10 GHz.This filtering step reduces the impact of the noise and the low-frequency interference.Note though that the E b /N 0 is computed before the filtering.
At the output of the autocorrelation receiver with three receiver branches, oversampling is performed at a rate equal to the number of frames per chip, that is, P = N d = 3.Note that the oversampling factor P = 3 leads to a sampling rate that is still lower than the Nyquist rate for the expected pulse shapes at the integrator outputs, but it is sufficient as the estimation errors for τ are negligible compared to sampling at the Nyquist rate [18].
An example of a received single user signal and noise (after bandpass filtering) is presented in Figure 10.In this figure E b /N 0 = 34 dB.At this E b /N 0 , the useful signal will clearly drown in the noise, yet the proposed method can synchronize, as will be illustrated next.
Figure 11 shows the percentage of cases where the packet offset is estimated incorrectly.An estimate is considered to be incorrect if it does not fall into the interval τ − T c /2 ≤ τ < τ + T c /2.The solid line depicts the performance of the delay estimation based on the subspace-based frequency-domain search (as presented in Section 3.3), while the dashed line shows the performance of the correlation-based scheme: which can be solved in a similar fashion as the subspacebased scheme, but which does not require a subspace decomposition.Figure 12 shows the standard deviation of the "good" estimates of τ expressed as a fraction of the chip duration T c for the subspace-based (solid line) and correlationbased (dashed) schemes.In Figure 13 the BER of the symbol estimates is shown for all estimates of τ.We deployed a decorrelating receiver and the vectorization approach described in Section 3.4.

Multiple user case
In this section, we test the resistivity of the proposed synchronization scheme to multiuser interference.To clearly see the effect of the multiuser interference, we consider a noiseless scenario.We define the signal to interference ratio (SIR) as SIR = 10 log(P 1 /P I ), where P i represents the energy of a single data symbol (bit) of user i at the receiver before the bandpass filtering step is carried out.Note that it corresponds to the E b for user i.As each data symbol (bit) is spread over N c chips and further over N d frames containing two pulses (a doublet) the signal energy can be expressed as , where h (i) (t) is the channel corresponding to user i. P I = N i=2 P i collects the energy P i of all interfering sources i = 2, . . ., N. In Figure 14, the received signal of a single user (a) and three users (b) can be observed (after bandpass filtering).For the three-user case, we take SIR = −10 dB, that is, the two interfering users together are 10 dB stronger than the user of interest.Note that the x-axis represents the number of samples, where sampling is performed at a rate of 10 ps.
In Figure 15, we present the recovery failure rate versus the signal to interference ratio.Any packet offset estimate τ that does not fit into the range τ − T c /2 < τ < τ + T c /2 is considered to be a failure.The chosen interval is considered to provide sufficient recovery of the energy after the deployment of a decorrelating receiver.We start by choosing a time-hopping and polarity code, the latter being a gold code sequence of length N c .Both codes remain unchanged in all Monte Carlo runs.In each run, a new set of channels as well as packet offsets is assigned to each of the users and is kept the same for all the values of the SIR changed in steps of 5 dB.N s = 30 stands for the number of data symbols   within a transmitted packet.The oversampling rate P = 3 equals the number of frames (doublets) per chip.Delays used in the time-hopping scheme are chosen to be D 1 = 1 ns, D 2 = 2 ns, and D 3 = 3 ns.In Figure 15, the solid, dashed, and dash-asterisked lines correspond to the two-, three-, and four-user cases, respectively.The performance of the algorithm drops by increasing the total number of users.This can be explained by an augmented influence of the crosscorrelation terms as the number of users increases.However, in the four-user case, the algorithm exhibits a low failure rate even for SIR = 0 dB, that is, in the case the energy of the signal of interest equals the energy of all interfering sources.This issue could be improved by selecting the user codes to have lower cross-correlation properties for any code offset.
Figure 16 describes the standard deviation of the "good" estimates of τ expressed as a fraction of the chip duration T c .Due to the low number of Monte Carlo iteration, and to the unresolvable ambiguity related to the initial sampling point, the three-user scenario has a slightly degraded performance compared to the other scenarios.

CONCLUDING REMARK
In this paper, we have presented an algorithm that provides fast, low-complexity, blind packet synchronization in multiuser TR-UWB systems.Its foremost application could be the fast initial code exchange in multiuser asynchronous UWB ad hoc networks.

Figure 1 :
Figure 1: The structure of the autocorrelation receiver.

Figure 2 :
Figure 2: (a) The signal at each of the M = 3 integrator outputs.(b) A measured UWB channel impulse response in a typical university building used to generate (a).

Figure 3 :
Figure 3: The signal at the output of the 1st, 2nd, and 3rd receiver branches.A single chip transmission is considered that comprises N d = 3 doublets.The width of the sliding window integrator is W = T c = 3T d .

Figure 5 :
Figure 5: The appearance of the signals at the integrator outputs for the single transmitted data symbol presented in Figure 4.

Figure 6 :Z T s 1 τFigure 7 :
Figure6: The structure of the P matrix.Each darkened block (a vector) collects the samples of p(t) and the shifts thereof; p(t) = 0 for t ∈ (0, 2T c ).

Figure 9 :
Figure 9: Block data model X = [G 1][S T (1 n ⊗ b)]T for the asynchronous single user case using a TR-UWB scheme.

Figure 12 :Figure 13 :
Figure 12: Standard deviation of the correctly estimated packet offset delays.

Figure 16 :
Figure 16: Standard deviation of the correctly estimated packet offset delays.

Relja
Djapic was born in Novi Sad, Serbia, in 1975.He received the Electrical Engineering degree from the University of Novi Sad, Serbia, in 2000, and the Ph.D. degree from TU Delft, The Netherlands, in 2006.His research interests include signal processing for communication systems, blind source separation, and synchronization schemes in wireless ad hoc networks and ultra-wideband systems.He is currently with TNO-ICT, Delft, where he is working on broadband communication techniques over cable and coax.Leus was born in Leuven, Belgium, in 1973.He received the Electrical Engineering degree and the Ph.D. degree in applied sciences from the Katholieke Universiteit, Leuven, Belgium, in June 1996 and May 2000, respectively.He was a Research Assistant and a Postdoctoral Fellow of the Fund for Scientific Research, Flanders, Belgium, from October 1996 to September 2003.During that period, he was affiliated with the Electrical Engineering Department, Katholieke Universiteit Leuven, Leuven, Belgium.Currently, he is an Assistant Professor at the Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands.During the summer of 1998, he visited Stanford University, and from March 2001 to May 2002, he was a Visiting Researcher and Lecturer at the University of Minnesota, Minneapolis.His research interests are in the area of signal processing for communications.He received the 2002 and 2005 IEEE Signal Processing Society Best Paper Awards.He is a Member of the IEEE Signal Processing for Communications Technical Committee, an Associate Editor for IEEE Transactions on Signal Processing, IEEE Transactions on Wireless Communications, and the EURASIP Journal on Applied Signal Processing.