A low-complexity channel training method for efficient SVD beamforming over MIMO channels

Singular value decomposition (SVD) beamforming is an attractive tool for reducing the energy consumption of data transmissions in wireless sensor networks whose nodes are equipped with multiple antennas. However, this method is often not practical due to two important shortcomings: it requires channel state information at the transmitter and the computation of the SVD of the channel matrix is generally too complex. To deal with these issues, we propose a method for establishing an SVD beamforming link without requiring feedback of actual channel or SVD coefficients to the transmitter. Concretely, our method takes advantage of channel reciprocity and a power iteration algorithm (PIA) for determining the precoding and decoding singular vectors from received preamble sequences. A low-complexity version that performs no iterations is proposed and shown to have a signal-to-noise-ratio (SNR) loss within 1 dB of the bit error rate of SVD beamforming with least squares channel estimates. The low-complexity method significantly outperforms maximum ratio combining diversity and Alamouti coding. We also show that the computational cost of the proposed PIA-based method is less than the one of using the Golub–Reinsch algorithm for obtaining the SVD. The number of computations of the low-complexity version is an order of magnitude smaller than with Golub–Reinsch. This difference grows further with antenna array size.

reliability of the wireless link, reducing the outage probabilities and boosting the overall energy budget of WSN nodes [2][3][4]. These works show that if channel state information (CSI) is available at both ends of the link, an optimal symbol error rate (SER) is attained by employing the SVD-beamforming method [5]. SVD beamforming consists of using the strongest singular value decomposition (SVD) eigenmode of the MIMO channel [6]. This is implemented by employing the principal right and principal left singular vectors of the MIMO channel matrix as beamforming weights at the transmitter and receiver, respectively.
A key limitation of SVD beamforming is that channel state information (CSI) is required at the transmitter (CSIT). In frequency division duplexing (FDD) systems, CSI has to be computed in the receiver and then sent back to the transmitter, introducing an additional burden to the data traffic. To address this issue, limited feedback techniques have been proposed, whereby the receiver selects the beamforming vectors from a predefined finite and indexed set [7][8][9][10]. Thereafter, only the index of the precoding vector that best matches the channel in effect must be signaled back to the transmitter. An important drawback of this technique is that the data feedback must be performed prior to having the beamforming signal-to-noise-ratio (SNR) gain available across the link, making this approach impractical for low SNR scenarios.
In time division duplexing (TDD) systems, channel coefficients are estimated at both sides of the link using training signals in both directions and by exploiting the reciprocity of the wireless channel [11]. Respective SVDs may then be calculated by both devices from the channel estimates.
Another difficulty of the SVD-beamforming scheme is that obtaining the SVD of the channel matrix is computationally costly. For general applications, the Golub-Reinsch algorithm (GRA) [12] is the most utilized method for calculating the SVD because of its numerical stability, reduced computational cost and acceptable convergence speed [13]. While much research has been done trying to find ways to reduce the complexity of the SVD computation [14,15], existent solutions are still inadequate for implementation in systems with a restricted energy budget and fixed-point computation constraints.
A family of TDD algorithms that require neither channel estimates nor SVD calculations have been explored in [16][17][18][19][20] and provide a way around the above-mentioned difficulties. These methods are based on the power iteration algorithm (PIA) [21] and require several back-and-forth transmissions before achieving a channel estimate good enough for reliable communication. One of the first of these algorithms is proposed in [17], in which an arbitrary symbol precoded with a unit vector is sent from the source. Then, only through normalization, conjugation and retransmissions of the received signals, the SVD-based beamforming link is established. A blind iterative MIMO algorithm (BIMA) is proposed in [18], which unlike [17] does not require a training stage. The precoding and decoding vectors are determined using payload data and are continuously updated while used at the same time for communication. The drawback of the algorithms of [17,18] is their slow convergence (i.e., higher error rate at the beginning of packet transmission) and their poor performance in low SNR scenarios [20]. To improve performance at low SNRs, [20] extends BIMA with an adaptive algorithm, which estimates the principal singular vectors at both sides using a weighted sum of previous estimates and the current received signal. This reduces the detrimental effect of the thermal noise significantly, but the convergence of the algorithm is still slow. Hence, the cited methods are still inadequate for packet-based transmissions and energy-constrained devices.
Our contribution in this work is a method for establishing SVD beamforming by means of the power iteration principle. In contrast to the prior art, however, the proposed method does not realize its power iterations by repeated transmissions over the air, and it uses instead single transmissions of preambles followed by local iterative computations at the receiver. In addition to the energy and time savings obtained this way, an additional trade-off between energy consumption of computations versus quality of the resulting beamforming weights and, consequently, versus bit error rate (BER) performance can be exploited by varying the number of computational iterations. Improving the quality of the beamforming vectors does not require more transmissions over the air, just more computation at each transceiver. For the special case in which only one iteration of the PIA is performed, a reduced-complexity formulation of the method is devised.
After describing the proposed method and modeling it mathematically, we assess its computational complexity and its BER performance. The computational costs of the proposed method and of the popular Golub-Reinsch algorithm (GRA) for performing SVD are determined and compared for different antenna array sizes in terms of number of arithmetic operations. It is shown that the computational cost of the proposed method is less than for GRA in all cases of practical interest. The cost of the reduced-complexity version is an order of magnitude smaller than for GRA.
The BER performance is compared to well-known multiple-antenna diversity techniques, including maximum ratio combining and Alamouti coding. It is shown that both are outperformed by a significant SNR margin, even by the proposed reduced-complexity version. For antenna array sizes up to 64, the reduced-complexity version is shown to attain a BER with an SNR loss smaller than 1 dB with respect to SVD beamforming based on least squares channel estimates and perfect SVD computation by the GRA, while requiring an order of magnitude fewer computations.
The rest of the paper is organized as follows: In Sect. 2, we briefly present the MIMO signal model and SVD-BF in order to establish the nomenclature used. Section 3 presents the structure of the transmissions over the air used by the method and explain the calculations that the devices at each end of the link have to conduct. The performance of the proposed technique is quantified by Monte Carlo simulations in Sect. 4. Finally, Sect. 5 summarizes the main conclusions.

System model
This section introduces the MIMO signal model and the SVD-based beamforming scheme for a system with N t transmit antennas and N r receive antennas. The signal at the receiver can be modeled as where y ∈ C N r is the column vector of received symbols at the N r antennas of destination device 2 , H ∈ C N r ×N t is the MIMO channel matrix of coefficients h ij that represent the complex fading gains from transmit antenna j to receive antenna i, column vector (1) y = Hx + n, x ∈ C N t represents the baseband-equivalent complex training or data symbols transmitted by the N t antennas of source device 1 , and n ∈ C N r is a column vector of complex additive white Gaussian noise (AWGN) with i.i.d. elements with zero mean and ν 2 variance. Throughout this work, the elements h ij are assumed to be i.i.d. circularly symmetric complex Gaussian random variables with zero mean and unit variance. In order to normalize the radiated power, the restriction ||x|| = 1 is imposed, where || · || denotes the Euclidean norm.
The SVD theorem [22] states that any matrix H can be factored as where (·) † denotes the conjugate transpose operator. The matrices for H . The matrix is an N r × N t diagonal matrix of nonnegative real numbers σ k , known as the singular values. These terms can be ordered such that σ 1 ≥ σ 2 · · · ≥ σ min(N t ,N r ) , where rank(H) ≤ min(N t , N r ) of these singular values are nonzero [6]. If CSI is available at both transmitter and receiver, then by using an SVD precoding, the MIMO channel can be decomposed into rank(H) parallel data streams commonly known as eigenchannels. The various eigenchannels have different statistical properties: the strong ones are useful when diversity is needed, while the weak ones can be used for increasing throughput [5]. The highest diversity gain is obtained by transmitting data only over the strongest eigenchannel, which is known in the literature as SVD-BF. The corresponding transmission scheme consists of using the first right singular vector v 1 to precode a scalar payload data symbol d ∈ C , which is then decoded at the receiver with the conjugate transpose of the first left singular vector u † 1 . The resulting communication can be modeled as where ñ is a scalar of complex AWGN with zero mean and variance ν 2 , ỹ is the received symbol d under equivalent thermal noise ñ and channel gain σ 1 . It is to be noted that the statistics of σ 1 can be well approximated using the Nakagami-m channel fading model [5].
The main difficulty of this technique is to obtain CSI at both sides of the link, particularly at the source device, and determine the first singular vectors from the channel matrix H.

Proposed method
In this section, we present a detailed method for establishing an SVD-BF link between two nodes 1 (source) and 2 (destination) in an environment where channel reciprocity between forward (source to destination) and backward (destination to source) transmissions can be assumed. Hence, if the signals in both directions use the same frequency carrier and bandwidth, as in TDD systems, then the channel response is the same [23]. Formally, for MIMO systems, a channel H in the forward direction has a reciprocal channel H T in the backward direction, where (·) T denotes the transpose operator.
Even though non-symmetric characteristics of the RF electronic circuitry break the channel reciprocity, various solutions to that problem are available, either hardwarebased or based on calibration algorithms [24,25]. As addressing this aspect is beyond the scope of this work, we assume that devices 1 and 2 are properly calibrated so that channel reciprocity can be assumed. We also assume perfect packet detection and timing acquisition for all the transmissions. It has been shown that these tasks can be performed with the same preamble structure used here for channel estimation [26].
In the sequel, we first describe the method and its various steps, followed by a detailed description of each one. Then, an algorithm for obtaining the first singular vector from the channel matrix estimate is provided. And finally, we present a computational cost analysis of Golub-Reinsch algorithm, the most common technique for obtaining SVD, for comparing it with the simplified method that we propose.

Conceptual description of the method
The technique for establishing an SVD-BF link entails two types of transmissions: Ping and Pong (Fig. 1). The Ping consists of transmitting a known time-orthogonal preamble from 1 to 2 , which allows for estimating the first left singular vector u 1 at 2 . This type of transmission does not contain payload data. After the Ping, an arbitrary number of Pongs containing preamble and payload may be sent alternatingly in both directions. The first Pong is a transmission from 2 to 1 composed of a preamble and payload data that are precoded at 2 with the left singular vector. The preamble thus received by 1 enables it to estimate the first right singular vector v 1 . 1 then replies to 2 with a next (second) Pong, which has the same structure as the first Pong (preamble followed by payload data), but is precoded with v 1 . The method is described with mathematical formality in Sect. 3.2.
The method might be used for two-way communications, because Pongs may carry payload data in both directions. However, for simplicity of description we present only a one-way communication scheme because the bidirectional case is a straightforward extension. In particular, we present the case when the communication is initiated by a node 1 that has information that it wishes to communicate to a neighboring node 2 . This communication situation requires at least three transmissions: Ping-Pong-Pong ( Fig. 1). It is to be noted that if the communication is initiated by a node that queries a neighboring node to find out if it has information to communicate, then the communication could be achieved with only two transmissions: Ping-Pong.
We focus on the case when the mobility of the environment is slow enough so that the coherence time of the channel is longer than the time required for a Ping-Pong-Pong transmission. In general, two-way SVD-BF communications could be maintained for longer than the coherence time of the channel if new Pong transmissions are made between both nodes more frequently than the coherence time. Furthermore, the re-estimations of singular vectors could be weighed with previous estimates as proposed in [20].
While the proposed method allows for calculating the first singular vectors on both sides of the link, it does not provide the first singular value σ 1 . However, as can be seen in (3), the knowledge of σ 1 is only necessary for decoding the data if the communication system uses amplitude modulation, such as quadrature amplitude modulation (QAM) or amplitude-shift keying (ASK). σ 1 may be estimated in several ways, such as by embedding further pilot symbols in the transmissions. Alternatively, in order not to increase the complexity or transmission overhead of the scheme, only phase modulations, such as quadrature phase-shift keying (QPSK), may be used. This is of particular interest for long-distance transmissions using SVD-BF, because it is more energy efficient to use small modulation sizes for these cases [27].

Mathematical formulation of the method
In the sequel, we describe the Ping and Pong transmissions in detail.

Ping
The Ping consists of sending a known preamble of complex symbols from node 1 to node 2 . The preamble is represented by an N t × L 1 matrix X 1 , whose rows contain the symbol sequences for each transmit antenna, and its columns index symbol time. Thus, L 1 is the duration of the Ping preamble in terms of symbol times. Even though the matrix can be composed by arbitrary sequences of symbols, for computation efficiency at the receiver it is best composed in a staggered form with L 1 /N t training symbols for each antenna [28]. We assume that they are taken from a column vector c 1 of L 1 known training symbols.
The received Ping is therefore where N 1 ∈ C N r ×L 1 is the complex matrix of AWGN at receiver 2 during the Ping reception. Upon reception, channel estimation is performed at the destination node 2 using Y 1 . We present our work based on the least square (LS) channel estimator due to its simplicity and limited computational complexity [28], but any other suitable estimator may be used. The LS estimate of H at 2 is given by It is to be noted that this pseudo-inverse matrix can be precomputed and stored permanently at 2 , so that only the matrix multiplication between Y 1 and the stored pseudo-inverse of X 1 is required with each Ping. An estimate û 1 of the first left singular column vector u 1 can be extracted from Ĥ using a power iteration algorithm. This step is explained later in Sect. 3.3. We assume that the estimation error in û 1 is an additive term r u ∈ C N t such that û 1 = u 1 + r u .

Pong in the backward direction
Using the estimate û 1 , 2 transmits the matrix X 2 =û * 1 c T 2 d T 2 to 1 , where (·) * denotes the complex conjugation, c 2 ∈ C L 2 is a column vector whose elements are a known preamble sequence of length L 2 symbols, d 2 ∈ C D 2 is payload data column vector of length D 2 symbols ( D 2 ≥ 0 ), and c T 2 d T 2 is the concatenation of row vectors c T 2 and d T 2 . Considering that the reverse channel is H T [23], this reverse-channel transmission can be modeled as where N 2 ∈ C N t ×(L 2 +D 2 ) is the AWGN matrix at the receiver 1 during the Pong reception.
An estimate of the first right singular vector v 1 can be obtained at the source 1 using LS estimation from preamble c 2 as follows: where Y 2c is the portion of the received signal Y 2 that corresponds to preamble c 2 , and column vector c 2 c † 2 c 2 −1 of size L 2 is the pseudo-inverse of c † 2 . As before, this vector can be precomputed and stored on each device beforehand. Hence, the calculation of (7) takes one matrix multiplication and one vector normalization.
In case that the backward Pong carries payload data, node 1 can decode it now using v 1 as follows: where Y 2d and N 2d correspond to the parts of the received signal Y 2 and thermal noise N 2 , respectively, that are associated with payload data d 2 . Vector r v is the estimation error of the first right singular vector v 1 and n 2d is the respective AWGN vector with i.i.d. zero mean and ν 2 variance elements. It is to be noted that if estimation errors r u and r v tend to zero, then (8) tends to σ 1 d T 2 + n T 2d , which corresponds to the vector form of (3) when several symbols are transmitted.

Pong in the forward direction
In case node 1 has payload data for node 2 , it transmits X 3 =v 1 c T 3 d T 3 , where c 3 ∈ C L 3 is a column vector of a known preamble of length L 3 symbols and d 3 ∈ C D 3 is the payload data column vector of length D 3 symbols. The received signal at 2 is where N 3 ∈ C N r ×(L 3 +D 3 ) is the AWGN matrix at the receiver 2 during the Pong reception.
It is to be noted that transmitting preamble c 3 at this stage is not strictly necessary for where Y 3c is the portion of the received signal Y 3 that corresponds to preamble c 3 . Again, the pseudoinverse c * 3 c T 3 c * 3 −1 can be precomputed and stored at 2 . The payload data is then decoded as where Y 3d and N 3d correspond to the parts of the received signal Y 3 and thermal noise N 3 , respectively, that are associated with payload data d 3 . n 3d is the corresponding AWGN vector with i.i.d. zero mean and ν 2 variance elements.
A summary of all the steps that were described and that make up a complete Ping-Pong-Pong sequence is presented in Fig. 2.

Computation of the first singular vector
In the sequel, we describe how to estimate the first left singular vector u 1 using a power iteration algorithm (PIA) on channel matrix estimate Ĥ obtained from (5) after a Ping transmission.
The most popular algorithm for computing singular vectors, the Golub-Reinsch algorithm (GRA), as most of the SVD algorithms, calculates all left and right singular vectors together as a set. But we are only interested in calculating u 1 at node 2 after the Ping. The PIA [21] offers a suitable approach to this. We first summarize the general PIA and then provide a simplified one.

General PIA
The PIA allows for computing the first left singular vector u 1 of a matrix H by exploiting the following property [21]: where W = HH † is a Wishart matrix, q 0 ∈ C N r is a random vector with unit norm and exponent m is a positive integer. It is to be noted that an estimate of u 1 can be defined as where Ŵ =ĤĤ † , with Ĥ given by (5) or any other suitable estimate.
Having a random initial vector q 0 instead of a fixed vector gives no benefit when u 1 is unknown, as is our case. Therefore, without loss of generality, we use q 0 [10 · · · 0] T .
We thus utilize the following version of the PIA for obtaining estimate û 1 .
The number of basic mathematical operations needed for each computational step of the algorithm is shown in Table 1.

Reduced-complexity power algorithm
For a lower-complexity algorithm, we can observe that in the special case when m = 1 , the result of matrix multiplication of step 4, with i = m = 1 , is where Ĥ 1,1:N t denotes the first row of Ĥ . We can hence use the following reduced-complexity power algorithm (RCPA) for obtaining û 1 .
The computational cost of the RCPA is smaller by roughly a factor mN r compared to the general PIA, as shown in Table 2:

Computational cost using the Golub-Reinsch algorithm
A comparison of the reduction in complexity that PIA and RCPA provide over the GRA during the Ping stage is provided next.
In "Appendix", we present a study of the computational cost of the GRA. We find that the total cost of performing the SVD for an N × N matrix takes Using the parameters of [29], we calculate the number of cycles that an arithmetic logic unit (ALU) requires for performing the decomposition of Ĥ to obtain û 1 using the GRA, PIA and RCPA. Results show that RCPA provides clear reductions on the complexity with respect to the PIA and GRA (cf. Fig. 3). It will be shown in Sect. 4 that this complexity reduction does not significantly sacrifice bit error rate performance.
It is to be noted that when comparing the computational complexity in terms of ALU cycles per calculated singular vector element, the GRA does require fewer operations than the PIA. But the GRA does not allow for computing only the first singular vector alone and forces to compute the entire SVD each time, resulting in a larger net computational cost than for the PIA, as shown in Fig. 3. The RCPA, on the other hand, requires an order of magnitude fewer operations than the GRA in either case (per vector element and total).

Results and discussion
In this section, we provide simulative valuations of the Ping-Pong-Pong (PPP) method using the PIA and RCPA algorithms.
We performed simulations in which the elements of each realization of the channel matrix H were generated randomly for each run as i.i.d. circularly symmetric complex Gaussian random variables with zero mean and unit variance. Thermal noise samples (15) were generated randomly as i.i.d. circularly symmetric complex Gaussian random variables with zero mean and variance ν 2 .
Ping and Pong packets were assembled considering L 1 = L 2 = 32 , L 3 = 0 and 500 symbols of payload data with an uncoded QPSK modulation. It is to be noted that it does not matter if the payload data is sent in the backward ( D 2 = 500 QPSK symbols, D 3 = 0 ) or forward ( D 2 = 0 , D 3 = 500 QPSK symbols) Pong transmissions. Both cases are equivalent in terms of the BER of the payload data as long as there is no re-estimation of the respective singular vector, i.e., as long as L 3 = 0 , which was always the case. Each PPP composed this way was transmitted over one million channel realizations.
The bit error rate (BER) performance of the proposed technique in 2 × 2 and 4 × 4 MIMO configurations is presented in Figs. 4 and 5, respectively. Both graphs also show the BER performance of a single-input single-output (SISO) channel with flat Rayleigh fading, of maximum ratio combining (MRC) receive diversity [6], of the iterative method presented by Tang [17] and of SVD beamforming with ideal channel knowledge and ideal SVD computation. In the case of 2 × 2 MIMO links, the BER performance of Alamouti coding is also presented [30]. To make a fair comparison, for all cases we considered the same total number of symbols used for channel training (considering both link directions) and the same total sum of signal power transmitted among all antenna branches. This means that in all cases the total energy spent for training transmissions is the same. All schemes used LS channel estimation. We observe for 2 × 2 that the SNR loss with respect to ideal SVD beamforming (curve SVD-BF-Ideal) is approximately 1 dB for the RCPA version (PIA with m = 1 iteration) and approximately 0.1 dB with m = 4 iterations (Fig. 4). In 4 × 4 the respective losses are approximately 2 dB and 0.7 dB (Fig. 5). It is also apparent that the proposed PPP method outperforms the BER of receive MRC, Alamouti and Tang in both MIMO configurations, even when the RCPA is used. The above results suggest that the SNR loss grows with MIMO channel size. In order to explore this aspect, we performed additional BER simulations of MIMO constellations of sizes 8 × 8 , 16 × 16 , 32 × 32 and 64 × 64 . The BER performance of SVD beamforming  with least squares channel estimation and ideal SVD computation was also assessed by simulation. This performance provides a best-case performance reference for the proposed iterative method. Preambles with 64 symbols were used in all cases. The corresponding SNR losses with respect to ideal SVD beamforming (ideal channel knowledge and ideal SVD computation) at BER = 10 −3 are shown in Fig. 6. While the SNR loss of the proposed method clearly grows with antenna array size, even the worst-case performance of RCPA stays within 1 dB of the best-case performance given by SVD beamforming with LS channel estimation. With respect to this latter case, the performance loss of the proposed method with m = 8 iterations is negligible. The overall BER performance of the proposed method at 64 antennas ranges between 5 dB and 6 dB of SNR loss with respect to ideal SVD beamforming. This is smaller than the loss observed for MRC diversity even at 2 × 2 and 4 × 4 configurations (cf. Figs. 4, 5).
The impact of using the re-estimation of vector u 1 at the forward (second) Pong stage, as given by (10), rather than using the û 1 estimated during the initial Ping, as presented in Sect. 3.2, is similar to performing an extra iteration of the PIA in the case without reestimation (Fig. 7). These curves were generated using the same parameters as for Fig. 5.
In the case when m = 1 (RCPA), the SNR improvement gained by the re-estimation can be as large as 1 dB.
The difference in BER between the PPP with m = 4 and the theoretical SVD-BF (with perfect CSI) is due to the channel estimation error. This aspect is evaluated in Fig. 8, where simulations with preambles of length L 1 = L 2 = 4 , L 1 = L 2 = 32 and L 1 = L 2 = 128 symbols are compared for the case of 4 × 4 channels estimated according to (5). We used L 3 = 0 in all cases. As intuition suggests, as the preamble grows in   Tang and superior to that of MRC diversity (compare with Fig 5). While extending the preamble length has diminishing returns in terms of BER performance, it does not spoil the performance gained by varying the number of iterations of the PIA.

Conclusions
In this article, we propose a low-complexity method for establishing a communication link over MIMO channels using SVD-based beamforming. The method takes advantage of the channel reciprocity property in order to acquire estimates of the precoding and decoding first singular vectors at both ends of the wireless link. This is attained with two types of transmissions: an initial Ping, consisting of a space and time orthogonal preamble transmitted once, and Pong, a beamformed preamble followed by beamformed payload data. Pong can be transmitted an arbitrary number of times in both directions, thus allowing for one-way or two-way communications. After an initial beamforming vector estimation at the receiver of the Ping, the receiver of a Pong preamble estimates or reestimates the singular vector that corresponds to that end of the link. This is performed with a power iteration algorithm.
Simulations show that four iterations suffice for attaining a BER within 1 dB of ideal SVD beamforming performance for MIMO array configurations of up to 4 × 4 . With 4 antennas and only one iteration (reduced-complexity algorithm), the SNR loss is within 2 dB of the ideal singular vector computation, but the complexity of the algorithm requires an order of magnitude fewer computations. It is also shown that the proposed method outperforms the BER of maximum ratio combining and of Alamouti coding.
For arrays with 64 antennas, the method is shown to achieve a BER performance within 1 dB of that of SVD beamforming with least squares channel estimates and perfect SVD computation.
The use of the PIA for this task is also computationally more efficient than the Golub-Reinsch algorithm for the SVD, whose main limitation is that it does not allow for computing only the first singular vector alone and forces to compute the entire SVD each time.
The BER degradation due to imperfect channel estimation was shown to be within 4 dB of ideal performance for a worst-case configuration (shortest training preamble, reduced-complexity algorithm). Further simulations show that re-estimating the vector at the Pong has an effect similar to performing an extra iteration of the PIA. The SNR improvement gained by the re-estimation can be as large as 1 dB.

Appendix: SVD computation cost
The Golub-Reinsch algorithm (GRA) [12] is popular for performing the SVD decomposition because of its numerical stability, efficiency and good convergence velocity [31]. Following [13], this "Appendix" analyzes the computational cost of the GRA on a matrix H of size N × N . The algorithm is composed of two phases: a bidiagonalization and a superdiagonal reduction.
In the following, we denote A j:n,k:n as the submatrix of A that contains rows from j to n and columns from k to n of A . Further, blank entries in a matrix represent zeros, while × or + terms represent nonzero coefficients.

Phase I: Bidiagonalization
The bidiagonalization is the process of turning an arbitrary complex matrix H into a bidiagonal real matrix B (i.e., a matrix with zeros in all entries except the diagonal and superdiagonal terms). This is achieved by a series of unitary transformations, which are described in the following.

Description
A Householder reflector is an unitary transformation P 0 that takes the first column of H , h 1 , into the direction of the first canonical axis ê 1 = (1, 0, . . . , 0) T , while rotating the other columns arbitrarily as where || · || represents the Euclidean norm.
A second Householder reflector Q 1 can be applied from the right, while preserving the first column intact, resulting in where g is the first row of the matrix (P 0 H) 1:N ,2:N .
By repeating this procedure with the lower submatrices, we can obtain where B is a bidiagonal matrix of real coefficients, and each P j and Q j is Householder reflector that operates in subspaces of dimension N − j.

Calculation cost
It can be seen that each P j acts non-trivially only over a N − j subspace. Hence, the computation of the non-trivial effect over the (N − j) × (N − j) matrix A can be computed as where P j corresponds to the (N − j) × (N − j) lower submatrix of P j and v ∈ C N −j is a vector calculated as where a 1 is the first column of A and a 11 is the first element of a 1 [31]. The calculation of v costs 2(N − j) real sums, an equal number of products, 1 square root and 1 sign operation. Recalling that one complex product consists of 4 real products and 2 real sums and that 1 complex sum takes 2 real sums, the cost of the application of P j is (16 v = sign(a 11 )||a 1 ||ê 1 + a 1 , 8(N − j) 2 + 2(N − j) − 1 real sums, 8(N − j) 2 + 5(N − j) real products and 1 division. The total cost of the transformation P j is thus given in Table 5.
The application of Q j is done repeating the same procedure to the hermitian of the lower (N − j + 1) × (N − j) submatrix. Therefore, (19) and (20) are valid but with A being an (N − j) × (N − j + 1) matrix. The cost C(Q j ) can be seen in Table 5.
Finally, the total cost of the phase I (cf. Table 3) can be calculated using (18) as

Phase II: Superdiagonal reduction
The second phase of the GRA reduces the upper diagonal terms into zeros, such that the real bidiagonal matrix B is diagonalized. It can be shown that it is not possible to build an algorithm that performs this in a finite number of steps [31]. Hence, this phase consists of reducing the size of the upperdiagonal terms until they are smaller than a given threshold.

Description
This phase entails a series of Givens rotations, which are unitary operations on the 2-dimensional subspace spanned by canonical vectors ê i and ê j . If G i,j (θ) is a Givens rotation on dimensions i and j with an angle θ , its effect on the canonical base {ê k } N k=1 is The first step of the second phase is to apply a Givens rotation T 1 = G 1,2 (θ 1 ) from the right, where the angle θ 1 is chosen such that T T 1 z = ||z||ê 1 for a given z . The effect of the application of T 1 is that a nonzero element is introduced: The rest of the second phase is to perform a series of Givens rotations to displace this nonzero element out of the matrix. It starts with a Givens rotation Q 1 = G 1,2 (θ 2 ) , which makes Q 1 y = ||y||ê 1 , where y is the first column of the matrix BT 1 . The result will have a zero in the desired position, but will introduce a new nonzero entry: in other case.
This procedure can be repeated until the nonzero position reaches the bottom of the matrix: At this stage, a last Givens rotation Q N −1 = G N −1,N (θ 2N −2 ) will act on the lower 2 × 2 submatrix and turn the desired element into a zero entry without introducing new nonzero entries, giving a new bidiagonal matrix This step can be repeated for generating a sequence of bidiagonal matrices B n . It can be shown that B n converges to a diagonal matrix D that has the singular values of the original matrix H.

Calculation cost
First we calculate the number of operations needed in one step of the algorithm, C k , which turns a k-dimensional bidiagonal matrix B n into a new bidiagonal matrix B n+1 . This cost has two sources: the cost of calculating the Givens rotations C (k) calc and the application of the Givens rotations C (k) app . The Givens rotation is used for rotating a two-dimensional vector (α 1 , α 2 ) onto its first axis: Therefore, the generation of a Givens rotation is equivalent to the calculation of cos θ and sin θ as function of (α 1 , α 2 ) . A stable algorithm for doing this is [13]: The average cost of calculating a Givens rotation is 1 sum, 1 product, 2 divisions and 1 square root. As each iteration of the algorithm consists of 2(k − 1) Givens rotations, the total calculation cost is given by C (k) calc (cf. Table 5). We still need to calculate C   (28) if α 2 = 0 : cos θ = 1, sin θ = 0; (29) if |α 2 | ≥ |α 1 | = 0 : v = α 1 /α 2 , w = 1 + v 2 , (30) sin θ = 1/w, cos θ = v/w; (31) else : v = α 2 /α 1 , w = 1 + v 2 , (32) cos θ = 1/w, sin θ = v/w.
The computational cost of the application of T 1 , denoted as C(T 1 ) , is 6 products and 2 sums. By considering (23), the application of the second rotation can be seen as Counting the operations, and recalling that the entry of the second row and first column is zero by construction, one finds that the cost C(Q 1 ) is 8 products and 3 sums. Comparing (23) with (24), one can conclude that all the rotations, excepting the very last one, have the same structure and therefore share the same costs.
The last rotation has the form Q k−1 (Q k−2 . . . T k−1 ) and costs 6 products and 3 sums. Adding all together, we obtain the total operations of the application of one step of the algorithm in a k-dimensional matrix Hence, the total cost of one k-dimensional iteration of the algorithm is It has been reported that the algorithm usually ends with no more than 2N iterations [12]. If we consider an average case where 2 iterations are needed per matrix size from 2 to N, then the total cost of the second phase (cf.  (35) (36) C k = C (k) calc + C (k) app .