# Automatic ARIMA modeling-based data aggregation scheme in wireless sensor networks

- Guorui Li
^{1}Email author and - Ying Wang
^{2}

**2013**:85

https://doi.org/10.1186/1687-1499-2013-85

© Li and Wang; licensee Springer. 2013

**Received: **12 November 2011

**Accepted: **27 February 2013

**Published: **25 March 2013

## Abstract

Data aggregation is a very important method to conserve energy by eliminating the inherent redundancy of raw data in wireless sensor networks (WSNs). In this article, we developed an automatic auto regressive-integrated moving averagemodeling-based data aggregation scheme in WSNs. The main idea behind this scheme is to decrease the number of transmitted data values between sensor nodes and aggregators by utilizing time series prediction model. The proposed scheme can effectively save the precious battery energy of wireless sensor nodes while keeping the predicted data values of aggregators within application-defined error threshold. We show through experiments with real data that the predicted data values of our proposed scheme fit the real sensed data values very well and fewer messages are transmitted between sensor nodes and aggregators than the native data aggregation scheme. Furthermore, the characteristics of the proposed data aggregation scheme are also discussed in this article.

### Keywords

Wireless sensor networks Data aggregation Time series analysis ARIMA model Pediction## 1. Introduction

Wireless sensor networks(WSNs) are made up of a mass of spatially distributed autonomous sensor nodes, to jointly monitor physical or environmental conditions, such as temperature, humidity, vibration, pressure, sound, motion, or pollutants [1]. These sensors could be scattered randomly in harsh environments such as battlefields or deterministically placed at specified locations to collect information from the environment. The typical application fields of WSNs include industrial process control, security and surveillance, traffic control, home automation, environmental sensing, structural health monitoring, etc. [2].

In WSNs, the communication cost of sensor node is often several orders of magnitude higher than that of computation. For instance, the transmission and reception energy costs for one bit of MICAz node [3] and TelosB node [4] are 600, 670, and 720, 810 nJ, respectively. However, the computation energy costs for 1 bit of them are only 3.5 and 1.2 nJ, respectively [5]. Therefore, data aggregation scheme is often adopted as an effective way to save the precious battery energy of wireless sensor nodes by eliminating the inherent redundancy in the raw data and avoiding unnecessary data transmission. Moreover, data aggregation scheme is also useful to extract application-specified general information from the raw data which are collected from the sensor nodes [6]. Hence, it is critical for WSNs to support data aggregation schemes.

There have been plenty of researches in the recent past on data aggregation schemes in WSNs. Typically, the whole sensor network is partitioned into hierarchical structure which consists of sink node, aggregators, and ordinary sensors. The aggregator utilizes specific functions, such as mean, min, or max, to aggregate incoming readings, and only the aggregated results are forwarded to the sink. Therefore, communication overhead can be reduced and packet collision can be avoided by decreasing the amount of transmitted messages. A comprehensive survey on data aggregation schemes of WSN was presented in [7]. And we will briefly review some representative data aggregation schemes in Section 2.

In this article, we proposed an automatic auto regressive-integrated moving average (ARIMA)modeling-based data aggregation scheme which utilizes time series model to predict the data of next several periods at both ordinary sensor nodes and aggregators based on the same amount of recent data values. The sensor node will build an appropriate time series model to predict the future data based on recently sensed data values and transmit the parameters of the model to the aggregator automatically. When the prediction error between the sensed value and predicted value is within the application-specified error threshold, sensor node will not transmit the sensed value to the aggregator. In this case, the aggregator will regard the predicted value as the sensed value in current data collection period. When the prediction error is beyond the application-specified error range, the sensor node will rebuild the time series model and transmit the sensed value with the new model to the aggregator in order to replace the incorrect predicted value and unsuited prediction model. We show through experiments that the predicted values of our proposed scheme fit the real sensed values very well and fewer messages are required to transmit between sensor nodes and aggregators.

The remainder of this article is organized as follows. In Section 2, we review some related works. In Section 3, we present our automatic ARIMA modeling-based data aggregation scheme. In Section 4, we describe our experiment settings and evaluation results. Finally, we conclude this article and present future directions in the Section 5.

## 2. Related works

There have been extensive researches in the field of data aggregation scheme in WSNs. According to the underlying route structure, the proposed data aggregation schemes can be categorized into four classes: tree-based data aggregation scheme, cluster-based data aggregation scheme, multi-path data aggregation scheme, and hybrid data aggregation scheme [8].

In tree-based data aggregation scheme, a spanning tree rooted at the sink is constructed and data aggregation operations proceed level-by-level from its leaves to its root. However, the cost of maintaining such a dynamic hierarchical tree structure is very high. In cluster-based data aggregation scheme, sensor nodes are divided into clusters and some special nodes, referred to as cluster heads, are selected to aggregate data locally and forward the result to the sink. In order to balance the energy cost of data aggregation, cluster head is rotated within the cluster. In multi-path data aggregation scheme, data are sent over multiple paths and aggregation is performed over these paths as packets move towards the sink level-by-level. In this kind of scheme, higher robustness is achieved by inducing extra overhead. Hybrid data aggregation scheme tries to overcome the problems of both the tree- and multi-path-based structures by combining the best features of both schemes. Hence, the whole network is organized into regions implementing one of the above two schemes. And the main difficulty is how to connect regions running different aggregation schemes.

More specifically, Heinzelman et al. [9] proposed low-energy adaptive clustering hierarchy (LEACH) to cluster sensor nodes and let the cluster head to aggregate data. The cluster head then transmits the aggregated results directly to the sink. Lindsey and Raghavendra [10] proposed power-efficient data gathering protocol for sensor information systems (PEGASIS) which organizes all sensors into a chain structure and rotates each node to communicate with the sink. Both LEACH and PEGASIS assume that each node in the network can reach the sink directly in one hop, which limits the size of the network for which they are applicable. Intanagonwiwat et al. [11] proposed greedy incremental tree which establishes an energy-efficient tree by attaching all sensors greedily onto an energy-efficient path and prunes less energy-efficient paths. However, it might lead to high communication cost in moving event scenarios for the reason of frequently pruning branches. Zhang and Cao [12] proposed dynamic convoy tree-based collaboration which assumes that the distance to the event is known to each sensor and uses the node near the center of the event as the root to construct and maintain the aggregation tree dynamically. However, it involves heavy message exchanges which might eliminate the benefit of aggregation in large-scale networks. Ding et al. [13] proposed energy-aware distributed aggregation tree scheme, which is based on energy-aware distributed heuristic. It only relies on local knowledge of the network topology and gives higher chances to sensor node with higher residual power to become a non-leaf tree node. Xu et al. [14] proposed cooperative data aggregation (CDA) scheme which is based on a cooperative communication mechanism. The heuristic algorithm MCT for CDA and its distributed implementation DMCT were also proposed in [14]. Recently, Villas et al. [15] proposed dYnamic and scalablE tree Aware of Spatial correlatTion (YEAST) scheme by exploiting the spatial correlation between sensor nodes. The sensor nodes that detect the same event are grouped in a correlated region and the group head is selected and rotated in each round. On the other hand, a structure-free real-time aggregation schemewas also proposed by Yousefi et al. [16]. It combines temporal and spatial convergence of packets using judiciously waiting policy and real-time data-aware anycasting policy, respectively, without explicit maintenance of a structure. Xiang et al. [17] investigated the application of compressed sensing theory to data collection in WSNs with the goal of minimizing the network energy consumption through joint routing and compressed aggregation. They proposed mixed-integer programming scheme in [17] and dual-level compressed aggregation scheme in [18].

However, none of the above data aggregation schemes have considered the problem of decreasing the number of transmitted data values between ordinary sensors and aggregator. They take for granted that sensor nodes periodically report sensed data values to the aggregator. However, the energy cost of data transmission and reception between them is not trivial. That is the focus and motivation of this article.

## 3. Automatic ARIMA modeling-based data aggregation scheme

Since the data generated by sensor nodes during continuously monitoring periods usually are of high temporal correlation, it indicates that there are redundant data in the successive data sequence, which causes unnecessary data transmission and energy consumption. In this article, we only focus on data transmission reduction and corresponding energy saving between sensor nodes and aggregators. Furthermore, we assume that a reliable message retransmission mechanism is adopted in the underlying MAC layer to guarantee the ARIMA model parameters and sensed data values could be delivered to the aggregator successfully even after collusion happens.

The automatic ARIMA modeling-based data aggregation scheme utilizes ARIMA model to predict the data of next several periods at both ordinary sensors and aggregators based on the same amount of recently sensed values. The ordinary sensors and aggregators work coordinately to reduce the amount of messages transmitted within the network.

### 3.1. The ARIMA model

Time series analysis uses historical data to develop a model for the prediction of future data values. The ARIMA model, also called Box–Jenkins model, is a widely used prediction model for univariate time series [19]. An ARIMA process can be divided into three components: auto-regressive (AR), moving-average (MA), and one-step differencing. The AR component estimates the current sample as a linear-weighted sum of previous samples; the MA component captures relationship between prediction errors; and the one-step differencing component captures relationship between adjacent samples. In ARIMA, the AR component captures the temporal correlation in the time series by modeling a future value as a function of a number of past values. The MA component is modeled as a zero-mean, uncorrelated Gaussian random variable (also referred to as white noise) [20].

*p*,

*d*,

*q*) model of time series {

*x*

_{1},

*x*

_{2}, …} is defined as

*B*is the backward shift operator, ∆ is the backward difference,

*d*is the order of differencing, and

*Φ*

_{ p }and

*Θ*

_{ q }are polynomials of order

*p*and

*q*, respectively.

*p*,

*d*,

*q*) model is the product of an AR part AR(

*p*):

*q*):

The parameters *Φ* and *Θ* are chosen so that the zeros of both polynomials lie outside the unit circle in order to avoid generating unbounded processes.

The construction steps of ARIMA model are shown in Figure 1. It includes the following five steps [21].

The noise series being analyzed must be stationary. When the variance of the noise series is non-stationary, the data must be transformed by differencing the original data to make the series stationary. If the series exhibits a trend over time or seasonality, or if some other non-stationary pattern exists, the series should be differenced repeatedly until the time series becomes stationary.

Step 2: Identify the model using ACF and PACF.

*k*-order autocorrelation coefficient of time series {

*x*

_{1},

*x*

_{2}, …} is defined as

*k*-order partial autocorrelation coefficient of time series {

*x*

_{1},

*x*

_{2}, …} is defined as follows:

Step 3: Estimate ARIMA model parameters.

After identifying a possible ARIMA model, we analyze the time series and estimate the model parameters. If the PACF of the differenced series displays a sharp cutoff and the lag-1 autocorrelation is positive, then consider adding one or more AR terms to the model. The lag beyond which the PACF cuts off is the indicated number of AR terms. If the ACF of the differenced series displays a sharp cutoff and the lag-1 autocorrelation is negative, then consider adding an MA term to the model. The lag beyond which the ACF cuts off is the indicated number of MA terms.

Step 4: Diagnose ARIMA residual series.

This step employs a white noise test to check whether the residual series from the model contains additional information that might be of use to a more complex model. In this case, the analysis must be continued by repeating Steps 3 and 4 until an appropriate ARIMA model is found which passes the white noise test.

Step 5: Choose the most suitable ARIMA model.

An ARIMA model with the smallest Akaike Information Criterion (AIC) indicator or Bayesian Information Criterion (BIC) indicator is selected as the most suitable ARIMA model for analysis.

*l*is the log likelihood,

*T*is the number of observations,

*k*is the number of right-hand side regressors, and ${\widehat{\mathit{\u03f5}}}^{\prime}\widehat{\mathit{\u03f5}}$ in Equation (11) is the sum of squared residuals.

The power of an ARIMA model resides in that it can incorporate all the AR term, the integrated term, and the moving average term together to model time series with a wide variety of features such as trend by simply adjusting the parameters of each term.

### 3.2. Data aggregation scheme

**Notations**

Notation | Meaning |
---|---|

{ | Data series |

{ | Stationary data series |

| Differencing order |

diff({ | Execute I order of differencing operation to { |

variance( ) | Calculate variance |

ϵ | Application defined stationary threshold |

δ | Application defined BIC indicator threshold |

*x*

_{1},

*x*

_{2}, …,

*x*

_{ n }}. If {

*x*

_{1},

*x*

_{2}, …,

*x*

_{ n }} is not stationary, we should make the differencing adjustment to data series until the difference between successive variances is smaller than the application-defined stationary threshold ϵ. Then, we fit ARIMA prediction model according to the differenced data series {

*x*

_{1}′,

*x*

_{2}′, …,

*x*

_{ n }′} using least square method. The iteration of ARIMA model fitting process follows the Box search path, which is shown in Figure 2. It can find an appropriate fitting model using a relatively small number of search times [22]. When the BIC indicator of an ARIMA model is smaller than the application-defined BIC threshold δ and the corresponding Ljung Box white noise test of fit residual passes, the iteration of ARIMA model fitting process will stop. In other words, an appropriate ARIMA prediction model has been built. Here, we choose BIC indicator over AIC indicator for the reason that BIC indicator is more consistent and penalizes free parameters more strongly than AIC indicator.

First of all, the ordinary sensor node runs automatic ARIMA modeling algorithm to build an appropriate ARIMA prediction model. It then sends the ARIMA model parameters to aggregator. After that, it calculates the predicted value according to ARIMA model and compares the sensed value with the predicted value. If the difference between them is less than the predefined error threshold, the sensor node will store the predicted value into historical data queue. Otherwise, it will store the sensed value into historical data queue and send the sensed value to aggregator at the same time. When the predicted value is beyond the fault tolerant range of the sensed value, the AIRMA model will be rebuilt and corresponding ARIMA model parameters of aggregator will be refreshed again.

The aggregator listens on the wireless channel to retrieve ARIMA model parameters and sensed values from ordinary sensor node. If the aggregator does not receive any data from sensor node after a predefined periodical data collection time, it means the difference between the sensed value and predicted value is within the acceptable error range. Then the aggregator will calculate the predicted value according to ARIMA model using historical data. Otherwise, it will store the received sensed value into historical data queue and prepare to update the ARIMA model parameters. The periodical data collection time should be selected carefully to ensure it is enough to deliver the message from sensor node to the aggregator. Meanwhile, reliable message retransmission mechanism should be adopted in the underlying MAC layer to guarantee the sensed value could be delivered to aggregator even after collusion happens.

## 4. Evaluations

### 4.1. Performance comparison

In automatic ARIMA modeling-based data aggregation scheme, ordinary sensor node will transmit the sensed data value to the aggregator only when the prediction error between sensed value and predicted value is beyond the application-specified error threshold. In native data aggregation scheme without data prediction, ordinary sensor node will transmit all the sensed data values to the aggregator. We will refer to it as native data aggregation scheme in the rest of this article. It is noteworthy that we only consider the problem of data transmission between ordinary sensor node and data aggregator. Both schemes can be combined with other data aggregation schemes which deal with data aggregation between aggregator and sink.

### 4.2. Performance evaluation

In this section, we evaluate the performance of automatic ARIMA modeling-based data aggregation scheme.

When the predicted value is beyond the fault tolerant range of the sensed value, the ARIMA model should be rebuilt and corresponding ARIMA model parameters should be transmitted to the aggregator. Therefore, the cost of ARIMA model rebuild is composed of two parts, the computation cost of ARIMA model and the transmission cost of ARIMA model parameters. The computation of ARIMA model is executed in the ordinary sensor node with the cost of a small number of search times [20]. It is well known that the communication cost is often several orders of magnitude higher than that of computation. Hence, the computation cost of ARIMA model is relatively low. After that, several bytes of ARIMA model parameters are transmitted from ordinary sensor to the aggregator. Compare with the general data and control message transmission within the network, the cost of model parameters transmission can be negligible.

*e*

_{ t }=

*y*

_{ t }–

*p*

_{ t }, where

*y*

_{ t }is sensed value and

*p*

_{ t }is predicted value. The influence of error threshold and historical data length on MSE, MAE, and MAPE are shown in Figures 11, 12, and 13, respectively. We can see from the figures that prediction accuracy decreases with the increase of the predefined error threshold and increases with the increase of historical data length. The reason behind this property lies in the fact that larger error threshold implies wider error tolerance range, which will result in lower prediction accuracy. Larger historical data length implies more precise prediction model, which will result in higher prediction accuracy. Hence, we should adopt small error threshold and large historical data length in order to improve the prediction accuracy of our proposed scheme.

## 5. Conclusion

We have introduced automatic ARIMA modeling-based data aggregation scheme in this article. Our motivation is to suppress the unnecessary transmitted data values between ordinary sensors and aggregator by data prediction. We first presented the ARIMA prediction model and then described how the ARIMA prediction model could be built and applied in data aggregation scheme to decrease the number of transmitted messages within the network. Our simulation and analysis indicate that the predicted values of our proposed scheme fit the real sensed values very well and fewer messages are required to transmit between sensor node and aggregator. The relationships between scheme performance and scheme parameters are also discussed in this article.

As a future work, we would like to improve our proposed data aggregation scheme by utilizing spatial and temporal data correlation characteristics together. Furthermore, we would like to implement automatic ARIMA modeling-based data aggregation scheme into a WSN testbed and evaluate its performance too.

## Declarations

### Acknowledgments

We would like to thank the anonymous reviewers for their constructive comments and suggestions on improving the presentation of this study. We also would like to thank the TAO Project Office of NOAA/PMEL for allowing us to use the TAO data. The study was supported by the Fundamental Research Funds for the Central Universities of China (Program no. N100323001 and N120423005), the Natural Science Foundation of Hebei Province, China (Grant no. F2012501014), the Scientific Research Foundation of the Higher Education Institutions of Hebei Province, China (Grant no. Z2010215), the Research Fund for the Doctoral Program of Higher Education of China (Grant no. 20120042120009), and the Science and Technology Research and Development Project of Qinhuangdao (Grant no. 2012021A029).

## Authors’ Affiliations

## References

- Li G, He J, Fu Y: Group-based intrusion detection system in wireless sensor networks.
*Comput. Commun.*2008, 31(18):4324-4332. 10.1016/j.comcom.2008.06.020View ArticleGoogle Scholar - Yick J, Mukherjee B, Ghosal D: Wireless sensor network survey.
*Comput. Netw.*2008, 52(12):2292-2330. 10.1016/j.comnet.2008.04.002View ArticleGoogle Scholar *Micaz*. Andover: US Memsic; 2011. . Accessed 11 November 2011 http://www.memsic.com*TelosB*. Andover: US Memsic; 2011. . Accessed 11 November 2011 http://www.memsic.com- Meulenaer G, Gosset F, Standaert F, Pereira O: On the energy cost of communication and cryptography in wireless sensor networks. In
*Proceedings of IEEE International Conference on Wireless and Mobile Computing*. Avignon, France; 2008:580-585.Google Scholar - Lee S, Kim S, Ko D, Kim S, An S: Prediction based mobile data aggregation in wireless sensor network. In
*Proceedings of 4th International Conference on Advances in Grid and Pervasive Computing*. Geneva, Switzerland; 2009:328-339.View ArticleGoogle Scholar - Rajagopalan R, Varshney P: Data-aggregation techniques in sensor networks: a survey.
*IEEE Commun. Surv. Tutor.*2006, 8(4):48-63.View ArticleGoogle Scholar - Fasolo E, Rossi M, Widmer J, Zorzi M: In-network aggregation techniques for wireless sensor networks: a survey.
*IEEE Wirel. Commun.*2007, 14(2):70-87.View ArticleGoogle Scholar - Heinzelman W, Chandrakasan A, Balakrishnan H: An application-specific protocol architecture for wireless microsensor networks.
*IEEE Trans. Wirel. Commun.*2002, 1(4):660-670. 10.1109/TWC.2002.804190View ArticleGoogle Scholar - Lindsey S, Raghavendra C: PEGASIS: power-efficient gathering in sensor information systems. In
*Proceedings of IEEE Aerospace Conference*. Montana, USA; 2002:1125-1130.Google Scholar - Intanagonwiwat C, Estrin D, Goviindan R, Heidemann J: Impact of network density on data aggregation in wireless sensor networks. In
*Proceedings of 22nd International Conference on Distributed Computing Systems*. Vienna, Austria; 2002:457-458.View ArticleGoogle Scholar - Zhang W, Cao G: DCTC: dynamic convoy tree-based collaboration for target tracking in sensor networks.
*IEEE Trans. Wirel. Commun.*2004, 3(5):1689-1701. 10.1109/TWC.2004.833443View ArticleGoogle Scholar - Ding M, Cheng X, Xue G: Aggregation tree construction in sensor networks. In
*Proceedings of IEEE 58th Vehicular Technology Conference*. Orlando, USA; 2003:2168-2172.Google Scholar - Xu H, Huang L, Zhang Y, Huang H, Jiang S, Liu G: Energy-efficient cooperative data aggregation for wireless sensor networks.
*J. Parallel Distrib. Comput.*2010, 70(9):953-961. 10.1016/j.jpdc.2010.05.009MATHView ArticleGoogle Scholar - Villas L, Boukerche A, Oliveira H, Araujo R, Loureiro A: A spatial correlation aware algorithm to perform efficient data collection in wireless sensor networks.
*Ad Hoc Netw.*2013, 11(3):966-983.Google Scholar - Yousefi H, Yeganeh M, Alinaghipour N, Movaghar A: Structure-free real-time data aggregation in wireless sensor networks.
*Comput. Commun.*2012, 35(9):1132-1140. 10.1016/j.comcom.2011.11.007View ArticleGoogle Scholar - Xiang L, Luo J, Vasilakos A: Compressed data aggregation for energy efficient wireless sensor networks. In
*Proceedings of 8th IEEE Conference on Sensor, Mesh and Ad Hoc Communications and Networks*. Salt Lake City, USA; 2011:46-54.Google Scholar - Xiang L, Luo J, Deng C, Vasilakos A, Lin W: DECA: recovering fields of physicals quantities from incomplete sensory data. In
*Proceedings of 9th IEEE Conference on Sensor, Mesh and Ad Hoc Communications and Networks*. Seoul, Korea; 2012:182-190.Google Scholar - Box G, Jenkins G, Reinsel G:
*Time Series Analysis: Forecasting and Control*. 4th edition. NJ: Wiley; 2008:47-92.View ArticleGoogle Scholar - Li M, Ganesan D, Shenoy P: PRESTO: feedback-driven data management in sensor networks.
*IEEE/ACM Trans. Netw.*2009, 17(4):1256-1269.View ArticleGoogle Scholar - Li G, Wang Y: A prediction based data aggregation scheme in wireless sensor networks.
*Adv. Mater. Res.*2011, 268–270: 517-522.View ArticleGoogle Scholar - Yang S, Wu Y, Xuan J:
*Time Series Analysis in Engineering Application*. 2nd edition. Wuhan: HUST Press; 2007:265-269.Google Scholar *TAO project*. Seattle: US NOAA; 2011. . Accessed 11 November 2011 http://www.pmel.noaa.gov/tao

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.