FPGA-embedded Linearized Bregman Iteration algorithm for trend break detection

Detection of level shifts in a noisy signal, or trend break detection, is a problem that appears in several research fields, from biophysics to optics and economics. Although many algorithms have been developed to deal with such a problem, accurate and low-complexity trend break detection is still an active topic of research. The Linearized Bregman Iterations have been recently presented as a low-complexity and computationally efficient algorithm to tackle this problem, with a formidable structure that could benefit immensely from hardware implementation. In this work, a hardware architecture of the Linearized Bregman Iteration algorithm is presented and tested on a Field Programmable Gate Array (FPGA). The hardware is synthesized in different-sized FPGAs, and the percentage of used hardware, as well as the maximum frequency enabled by the design, indicate that an approximately 100 gain factor in processing time, concerning the software implementation, can be achieved. This represents a tremendous advantage in using a dedicated unit for trend break detection applications. The proposed architecture is compared with a state-of-the-art hardware structure for sparse estimation, and the results indicate that its performance concerning trend break detection is much more pronounced while, at the same time, being the indicated solution for long datasets.


I. INTRODUCTION
T REND BREAK detection in the presence of noise is a broad problem that can be found across different research fields [1]- [5].For that reason, several different methodologies have been proposed in the literature [6]- [8], with the ones that make use of 1 regularization to counter the problem's inherent high-dimensionality arguably figuring as the most successful ones [9], [10].Such an approach is required for highly reliable estimation results [8].Even though such regularization allows the problem to be solved in a computationally efficient manner (usually associated to a complexity which is proportional to a polynomial function of the number of inputs), the fact that a computer can solve the problem does not necessarily mean that the result is achieved quickly, practically speaking.In certain contexts, achieving elapsed algorithm times in the order of seconds as opposed to minutes may yield a substantial impact on the application [11], [12].
It is a widespread notion that certain problems, despite their complexity, may be accelerated depending on the implementation; parallel programming, in which several parts of the same procedure are processed independently and simultaneously, is one of the most celebrated examples [13].Field Programmable Gate Arrays (FPGAs) are extremely versatile hardware structures that offer [14]- [16]: great flexibility to design high speed high-density digital hardware; easiness of programability and reconfiguration; energy efficiency; high resource utilization; low cost; and the possibility to combine parallel processing structures with serial control units.FPGAs haven been used as a versatile computing platform accelerating algorithms through dedicated and carefully designed architectures in a wide range of fields such as cryptography [17], image processing [18], and machine learning [19].Recently, linearized Bregman Iterations (LBI), a class of implementation-efficient and low-complexity algorithms, has been presented as an extremely attractive solution for trend break detection [1].In particular, both the structure of the trend break detection problem and of the LBI algorithm's allow for simple hardware units, relying mainly on adders and efficient memory management, to conduct the core procedure, thereby avoiding hardware-complex multiplication and division operations [20].
In this work, the hardware implementation of the LBI algorithm is studied in depth and is simulated and synthesized for different FPGAs.A novel hardware architecture is presented and its main processing units are discussed.VHDL simulation environments enables a step-by-step comparison and validation of the processing stages referenced by the computer algorithm implementation [1].Hardware synthesis results allow determination of both device usage with different FPGA sizes and maximum clock frequency; the former, combined with the average number of clock cycles per iteration loop, make total processing time calculation possible for different problem instance sizes.A reduction factor on the elapsed algorithm time of approximately 100 is achieved, which represents a substantial upgrade and warrants usage of dedicated hardware for trend break detection.
Due to the limited memory resources available in an embedded processing unit as opposed to a standard personal computer, different data formatting was employed in the hardware implementation.This not only allows for increased data length manipulation inside the embedded unit (with estimated ≥ 60000 data points for a mid-sized FPGA), but also avoids the usage of complex and often slower arithmetic units to handle floating point representation [21].The comparison between results using the 20-bit fixed-point (SFIXED) [22] format and standard 64-bit double representation have been conducted in this work and negligible discrepancies were observed.
The paper is divided as follows.In Section II, a brief review of the LBI algorithm for trend break detection is performed, including the structure of the candidate matrix and the pseudocode based on which the hardware architecture is developed.Section III presents the digital hardware architectural concept as well as focused descriptions of its main units; the estimated number of clock cycles until the algorithm elapses is derived based on this architecture.In Section IV, comparative results between the simulated hardware implementation and the Julia code of the LBI are discussed.Synthesis parameters for two target FPGAs (ALTERA CYCLONE V and ALTERA STRATIX V) are also reported.Case studies and implementation results are discussed in Section V, and Section VI concludes the paper.

II. THE LINEARIZED BREGMAN ITERATIONS ALGORITHM FOR TREND BREAK DETECTION
Under the assumption that the trend break detection problem is a sparse one, i.e., the number of candidate vectors that describe the signal of interest is much smaller than the number of observations, it can be cast into the combined 1 / 2 problem of the form [1]: where A is the dictionary, with each candidate vector stored in a column, β is the vector containing the coefficients of the weighted linear combination of dictionary vectors that will approximate the signal of interest represented by the data series y, and λ is a parameter that adjusts the weight of the 1 versus the 2 norm.Adaptation of the Linearized Bregman Iterations algorithm to trend break detection has been presented in [1] in a context where a linear trend is also expected in the signal of interest.In order to simplify and generalize the implementation, this linear trend is not considered in the current implementation.
Incorporating the linear trend in the proposed architecture can, however, be easily done.Throughout the manuscript, the length, in data points, of the signal of interest y will be defined as N , i.e., y and β are N-dimensional vectors and A is an N × N matrix.The Linearized Bregman Iterations algorithm has a periodic structure, involving, in a single iteration, an approximate gradient descent (AGD) followed by a non-linear shrink function of the form: shrink (v, λ) = max (|v| − λ, 0) • sign (v) [23].Due to the special structure of the candidate dictionary matrix A for the trend break detection problem, namely: its storage is not necessary for the AGD calculation, as the latter can be rewritten as where the a k represent columns of the candidate matrix, the superscripted i represents the iteration index, and the index k ∈ [1 : N ] controls the cyclic re-use of rows of A as the iteration index evolves, i.e., k = mod ((i − 1) , N ) + 1.
The a k , in turn, have an interesting structure that allows the AGD to be further optimized and the calculation to be performed only for those indices where a k,j = 0.In other words (and also considering the fact that which, considering computational implementation, translates into accessing and manipulating only those values of vector v (k) up to index j.A final observation of the structure of matrix A (namely, the fact that it is a square matrix) reveals that a single index k is sufficient to control an iteration of the algorithm.The resulting procedure, presented as a pseudocode in Algorithm 1, efficiently solves the trend break detection problem with low memory usage.
Algorithm 1 Linearized Bregman Iteration for Trend Break Detection Input: Measurement vector y, λ, β start , v start , L Output: Estimated β instantaneous error with inner product 8: for j = 1..k do 10: end for It is important to note that only the computation-heavy part of the algorithm is depicted in the pseudocode of Algorithm 1.Its purpose is to identify the relevant non-zero values of the β vector that compose the output or, in other words, reduce the dimension of the detection space focusing on the subspace spanned by the relevant candidate vectors.After this procedure, it is usual to perform an Ordinary Least Square (OLS) in this reduced subspace in order to remove any biasing introduced by the algorithm; operating on the reduced subspace found by the LBI drastically reduces the complexity of the OLS.This step, which involves matrices transposing and inverting, can be efficiently conducted in a standard personal computer and, even though this could also be implemented in the same hardware structure that contains the core algorithm [1], the goal of this work is to present the latter and the OLS step is left as a post-processing to be performed in a different processing unit.
Also left as a pre-processing step is the scaling of the data vector y, which is necessary to ensure the correct behaviour of step 7 in Algorithm 1 when using the 20-bit fixed-point format; namely, that no overflow of the arithmetic dynamic range is observed when performing the summation of β values.The scaling is intimately connected to the available arithmetical dynamic range, which, in turn, is connected to the memory resources of the FPGA board, thereby constituting a design-related compromise relationship: in case of over-scaling, the arithmetic dynamic range will be hindered; to overcome this, a higher number of bits can be assigned to the data points, which then increases the resources necessary in the FPGA board; on the other hand, in case of under-scaling, the results may overflow, creating errors that can jeopardize the algorithm's convergence.A scaling factor consistent with the algorithm's convergence can be determined according to the following considerations.
The major source for overflows is the sum calculation of the β values in line 7 of Algorithm 1. Empirical tests conducted based on the testbench developed in [1] indicated that a scaling based on dividing the data vector y by its maximum value allowed to obtain the results shown in Sect.IV without harming overflow effects.This is due to the firmly non-expansive property [24] of the shrink function as well as the negative feedback of the error between k+1 s=1 β (i) s and y k (line 7 of Algorithm 1).To clarify the negative feedback effect, one could multiply both sides of line 7 by -1, i.e., −e = y k − k+1 s=1 β (i) s .This procedure would require µ k , in line 10, to also be multiplied by -1.
The algorithm can be then be interpreted as a stabilizing loop on the values of v (i) j with the mentioned negative feedback on the deviation between the sum of β values (functions of v (i) j ) and the corresponding y k , which causes overshoots of the sum of β values to be immediately corrected in the next iterations.This leads to the fact that, even in worst case scenarios (multiple up and down trend breaks in the measurements), the sum of β values scarcely goes above unit, considering the above mentioned normalization procedure.Moreover, even though its ratio of occurrence is negligible, in the case an overshoot occurs, the excess value would be small thereby not compromising the convergence of the algorithm, as the performance of the quantized version in Sect.IV demonstrates.The closeness of these results to the ones obtained using double precision floating point shows that, practically, harming effects due to overflow can be neglected.
Another important observation is that the stopping criterion developed in [1] has been presently neglected, and a fixed number of iterations has been considered; the reason for that is twofold.First, the context in [1] was that of fiber fault location, and the stopping criterion was developed based on a specific phenomenological observation that cannot be extended to more general trend break detection problems.Secondly, with a fixed number of iterations, the standard 64-bit double results can be compared to those of our 20-bit fixed point implementation in a fair scenario.In other words, this quantization could potentially influence the effect of the stopping criterion making the comparison biased.

III. FPGA ARCHITECTURE
Even though the iterative nature of the Linearized Bregman Iterations algorithm does not allow for parallelization over the iterations, two core operations that permit parallel pipelining can be identified within a single iteration, as presented in Algorithm 1: the summation of k entries of the vector β; and the processing (including update, shrinkage, and storage) of vectors v and β.By instantiating parallel memory structures, both operations, that represent computational bottlenecks of the algorithm's iterations, can be optimized.On one hand, the summation can be efficiently performed in a so-called parallel adder tree (PAT) (logarithmic number of time steps) given that the data can be accessed in parallel.On the other, parallel processing of the data in vectors v and β can also be accelerated if storage can be performed in parallel.Since the algorithm relies on the computation of several iterations to converge, optimizing these two procedures allows for substantial gains in processing time.

A. Memory Structure
In order to harness the parallel speedup of the PAT, the entries of vector β must also be accessed in parallel, which can be accomplished through the instantiation of parallel Block RAMs (BRAMs).The data storage is structured as follows: where M is the number of parallel BRAMs available in the FPGA.In such a structure, a single arbitrary BRAM, say m, will contain entries: where the ceiling operator is denoted as • .
Vector β, however, is not the only vector stored throughout processing: vectors y and v are also necessary.Since all these contain the same number N of entries, the data is sectioned such that the address depth of each BRAM is divided in three slices with address pointers (ap) associated with β (β ap ), y (y ap ), and v (v ap ); β ap is arbitrarily set to zero.Under this rationale, entries of vectors v and y would appear at addresses t + v ap and t + y ap , respectively, even though, for simplicity, only entries of vector β are shown in Eq. 5. Using this data storage structure, all positions [tM + 1 : tM + M ] of either vectors can be accessed from parallel BRAMs within a clock cycle; such data segment will henceforth be referred to as a parallel row, with t the parallel row pointer following its definition in Eq. 5. A block diagram of the digital hardware architecture depicting a single BRAM and including the major structures of the LBI algorithm hardware implementation is presented in detail in Fig. 1.
Apart from the PAT, a Pipelined Multiplexer Tree (PMT) is used to select the specific value of the data series y, namely y (k), from which the result of the beta sum is subtracted from (refer to line 7 of Algorithm 1).The architecture of the PMT is such that the number of stages meets that of the PAT, so synchronization between the two outputs is naturally ensured.Furthermore, the selection key that acts on each stage of the PMT is derived from the cyclic iteration index, k.The value of µ k = 1 k , which involves a computation-heavy division, has been delegated to a pipelined CORDIC structure; the pipeline's stages are pre-filled before the iterations are started and stage propagation is enabled at each new iteration, ensuring that the correct value is always available.
Based on this memory structure, the amount of clock cycles necessary to complete the calculation of d depends both on the number of data points and on the depth of the PAT, which, in turn, depends on the number of instantiated (or available) parallel BRAMs in the hardware structure.For an arbitrary iteration cycle, with cyclic index k, the equation that relates these values to the total number of clock cycles is C r = k M + log 2 M , where the subscript refers only to the reading and processing of β values up to the output of the PAT.Taking into account also the subsequent subtraction and multiplication steps -refer to Fig. 1 -, each taking one clock cycle, the total number of clock cycles amounts to C r = t + log 2 M + 2, where t = k M denotes the maximum value of t during an iteration.

B. PAT Input Control
Even though a parallel row is accessible at each clock cycle due to the parallel instantiation of the BRAMs, clearly not all values in the row will be used during a given iteration with index k.For that reason, a multiplexer (PAT input MUX in Fig. 1) is connected immediately after the BRAM ouput with its remaining input connected to a null value.Due to the additive identity property of 0, the output of the multiplexer can be directed to the PAT without corruption of the result while accomodating the parallel storage structure.
The selection signal that controls the PAT input MUX is derived based on the fact that replacing BRAM outputs by 0 is only necessary during the last parallel row access, i.e., when t = k M = t.Selection is thus based on an auxiliary counter that records the aforementioned value and on a so-called unary code (or thermometer code), which encodes the last column index that contributes to the sum.Fig. 2 depicts the control unit responsible for handling the BRAM input address and selection of PAT input.

C. β and v Storage
An indispensible step of the algorithm is the correct storage of the vectors β and v after processing.According to Algorithm 1, all the elements of vector β are processed by the shrink function right after processing of the vector v.As previously pointed out, acceleration of the storage procedure tackles one of the algorithm's bottlenecks: the number of clock cycles taken by the storage procedure scales the total number of clock cycles necessary for the algorithm to elapse.Both the fact that the BRAMs allow for writing and reading from two independently addressed ports and that if λ is set to zero in the shrink function it implements the identity transformation have been harnessed to perform data storage optimization, as it is detailed as follows.
One of the BRAM's ports (taken as B without loss of generality in Fig. 1) is responsible for reading the values of v from the memory while the other port (A) is responsible for storing the values of β and v.The addresses are controlled such that, on the first clock cycle, values of v in a parallel row are read (through port B), processed in the 20-bit full adder, and sent to the shrink function with λ = 0. Therefore, at the following clock cycle, the stable value of v can be stored (through port A) at the same time as the value of λ is changed in the shrink function and processes the values of v being read (through port B).In the third clock cycle, a stable value of β is stored (through port A) while the values of v from the following parallel row are accessed (through port B), initiating a new storage cycle for a subsequent parallel row.

D. Total Clock Cycle Estimation
The net amount of clock cycles per parallel row storage is, thus, two if one does not compute the very first and last accesses; therefore, the number of clock cycles necessary at an arbitrary iteration with cyclic index k is C s = 2 k M + 2 = 2 t + 2. Two extra clock cycles are also necessary for the hand-shaking protocol between the iteration control unit (presented in Fig. 2)whose control over the BRAMs address is releaved -and the writing unit that takes over control and stores vectors β and v, i.e., C s = 2 t + 4. Combining this value with the previously determined C r for the reading and processing of β values in the PAT, the total number of clock cycles spent in an arbitrary iteration with index k is C T = 3( t + 2) + log 2 M .The total number of clock cycles taken by the algorithm to elapse can be easily derived from this equation by summation over L, the total number of iterations: The factor F in Eq. 6 accounts for pre-and post-processing instructions performed by the control unit such as: master resets; granting control over the BRAMs; and, most importantly, preemptively filling up the pipelined CORDIC that calculates µ k .However, as will be described in the next Section, the value of F is much smaller than the total number of clock cycles taken by the core procedure.Fig. 3 presents the dependence of the total number of clock cycles until the algorithm elapses with both the number of available parallel BRAMs for a fixed number of data points and with the number of data points for a fixed number of available BRAMs.In both cases, the iterations per data point (defined as L/N ) is fixed at 650, a realistic value which will be discussed further in the next Section.Considering a realistic scenario, where the maximum clock frequency achievable in the target FPGA is around 100 MHz, a 10000-point data series would be processed in less than two seconds, which represents an approximately 100 gain factor when compared to the Julia implementation reported in [1].

IV. VALIDATION AND SYNTHESIS
Comparison between the software-defined hardware implementation of the Linearized Bregman Iterations algorithm using the architecture presented in the previous Section and its software implementation counterpart [1] permits validating the former.In order to provide a bit-true validation, the SFIXED standard used in the VHDL simmulation was implemented in Julia allowing one to accompany, step-by-step, the evolution of the algorithm on both platforms and identify any discrepancies.Due to the fact that the rounding procedure is the same for both, no such discrepancies were observed; the fixed point Julia simulation code outputs exactly the same values as of the hardware implementation.The validation of the hardware implementation and the demonstration of its equivalence to the Julia SFIXED implementation creates a versatile tool to estimate the performance of the FPGA results on a software environment.For the simulation of the hardware implementation, the MODELSIM VHDL simulation environment was employed.In such an environment, both the evolution of the algorithm as well as the number of clock cycles necessary to run each iteration can be extracted, so the results of Eq. 6 can also be ascertained.Even though an extremely reliable and versatile tool, VHDL simulation offers a drawback in terms of running time: simulating a high number of BRAMs or a large dataset can be extremely timeconsuming.For this reason, a predetermined set of parameters (data points, iterations, and number of BRAMs) were chosen to showcase the validity of the hardware implementation.
Table I contains the information regarding the simulation of the hardware structure under the different parameter conditions, where B stands for the number of BRAMs, and L and N follow the previously defined notation.The estimated number of clock cycles based on Eq. 6 that appear in Table I take  Up to 1024 BRAMs could be instantiated in the STRATIX V, with as high as 109 MHz maximum clock frequency yielding a 1.91 second processing time for 10000-long data series considering 650 iterations per sample (overall, 6.5 million iterations).This result represents a 100 speedup factor in processing time with respect to software implementations under the same data conditions but running on a INTEL XEON CPU E5-2690 V4 at 2.6 GHz and 512 GB RAM, a major achievement, which advocates for the dedicated hardware solution for trend break detection problem.It is also interesting to note that, for a smaller FPGA, the CYCLON V, a ∼ 16 gain factor with respect to the software implementation was achieved, which is interesting in the sense that smaller FPGAs exhibit, generally, significantly lower costs, but could still deliver processing times in the range of a few seconds.
The results from Table II also indicate that a compromise between instatiation of higher number of BRAMs (which reduces the total number of clock cycles necessary for the algorithm to elapse as determined by Eq. 6) and the maximum achievable clock frequency exists.In fact, the processing time for 512 instantiated BRAMs was lower than that of 1024 because the gain in clock frequency superseeded that of the reduction of clock cycles.It should be mentioned, however, that the place and routing problem is an extremely complex one and the algorithms that solve it may not always reach the best possible solution, so the clock frequency values obtained should be interpreted as lower bounds.Finally, to put the results into an application proned perspective, fiber profiles as long as 50 km could be analyzed in search for breaks in under 10 seconds [1].

V. CASE STUDY RESULTS
Validation of the software Julia SFIXED implementation performed in Section IV allows one to investigate aspects of the hardware implementation in a more suitable simulation environment.This is important due to the amount of simulation workload necessary to yield statistically relevant results.Point in fact is the investigation of the quality of estimation as L, the total number of iterations, increases, for which the results of over 15000 different testbench simulated profiles, analyzed with the hardware-validated bit-true Julia SFIXED implementation, is presented in Fig. 4.
The objective of this extensive study is twofold: firstly, as it was already mentioned, to empirically determine the interval of number of iterations per sample for which the estimation quality reaches a reliable level; and, secondly, to compare the quality of estimation between the 64-bit double implementation, and the 20-bit SFIXED implementation for profiles with different number of data points and with different number of iterations per sample.The profiles analyzed with the two versions of the algorithm (64-bit double and 20-bit SFIXED) were created following the rationale of the testbench profile generation presented in [1], i.e., noiseless sparse vectors with randomly sorted magnitudes (β ideal ) are used to create profiles using the candidate matrix A, to which white gaussian noise is added.The result of the estimation is β est , which is then compared to β ideal using the squared error norm; the closer to zero error between the estimated and ideal vectors, the better is the estimation.This procedure is similar, but not equivalent, to the one described in [1] since no slopes are considered and the noise parcel is not derived based on any phenomenological observations.
From the results of Fig. 4, it is possible to conclude that the differences in estimation between the SFIXED and 64bit double implementations are negligible, i.e., the hardware implementation will have no problems achieving comparable estimation accuracy as the software version, e.g. in [1].Furthermore, it becomes clear that, after 450 iterations per sample, the accuracy stabilizes, with a averaged squared error norm value in the order of 0.5.This indicates that the results from Fig. 3 with 650 iterations per sample are indeed realistic.Finally, this result is also useful when interpreted along with those of Fig. 3: a compromise between the total number of clock cycles before the algorithm elapses and the quality of the estimation can be found and, in specific cases, one of these can be sacrificed (increasing the processing time or allowing for a worst estimate) to boost the other (faster results or extremely precise estimation).

VI. CONCLUSIONS
Trend break detection, or level-shift detection, is a problem that permeates several science fields, and an efficient, accurate, and highly reliable processing unit to solve it is desirable.Combining the flexible hardware design tools of Field Programmable Gate Arrays and the efficient Linearized Bregman Iterations algorithm allowed the development of such unit.The manipulation of the data storage structure as well as the algorithm flow and control in hardware yielded an up to 100 times gain in processing time when compared to a personal computer while maintaining all the observed qualities of the algorithm, such as low estimation error and high level-shift detection precision.
The proposed hardware architecture can be implemented in different sized-FPGAs, with the main distinctions being the amount of available dual-block RAMs and maximum achievable clock frequency, characteristics which are hardware-dependent.On a middle-sized chip such as the ALTERA CYCLONE V, the hardware supports up to 256 parallel BRAMs with a maximum clock frequency of 81 MHz and a total processing time of 6 seconds.Such processing prowess can be directed towards on-line data supervision such as optical fiber monitoring, which constitutes an exciting future point of investigation.Furthermore, incorporating advanced signal processing techniques into the the hardware design in order to eliminate any pre-processing step while increasing the convergence speed is also a sought-after goal for future studies.L is simply this value multiplied by the number of points (N) in the profile.The results are divided into two panels for ease of visualization.

Fig. 1 .
Fig. 1.Hardware implementation of a single BRAM slice in the LBI core structure.The parallel structures are pictorially depicted in three-dimensional depth.The Pipelined Multiplexer Tree (PMT) is synchronized to the PAT such that, after a summation, the correct value of y is ready for subtraction.The value of µ k -refer to Algorithm 1 -is calculated in a pipelined CORDIC structure.

Fig. 2 .
Fig.2.Iteration control architecture.A NEW ITER flag generated by a higher-level unit and the clock signal are the necessary inputs.The cyclic iteration index counter k is implemented through a simple counter with parallel load dependent on the comparison with the signal length N .The unary code propagates at each new iteration and, when the M + 1th stage is reached, it auto-resets while also incrementing the t counter.The unary code acts on PAT input MUX when t = t; only the selection for m = 1 is depicted for clarity.The different address pointers are combined with the counter t to produce the correct BRAM address.

Fig. 3 .
Fig. 3. Estimate of total number of clock cycles necessary for the algorithm to elapse considering the presented architecture.(a) Dependence with respect to number of available parallel BRAMs for a fixed number of 10000 data points and 650 iterations per data point.The gray-shaded areas highlight the transition between powers of 2, which manifests as sharp increases in the calculated value of C. (b) Dependence with respect to the number of data points for a fixed number of available BRAMs and 650 iterations per data point.The highlighted curve corresponds to 2000 BRAMs, which is the maximum available for the largest target FPGA studied here.
into account the required F = 21 extra clock cycles for initialization and control.The asterisk in the last column indicates that 2048 BRAMs is actually above the 2000 maximum available number of BRAMs with 20 bit-wide data entries in the target ALTERA STRATIX V FPGA.

Fig. 4 .
Fig. 4. Estimation results for both SFIXED and floating point implementations for different profile lengths and different numbers of iterations per sample;L is simply this value multiplied by the number of points (N) in the profile.The results are divided into two panels for ease of visualization.