### Data processing

Nowadays, cellular network technology is the most widely used technology in the world. A communication system is composed of several base stations. Mobile devices that receive strong enough signals in the base station area are connected to the network and thus can be used for communication [10]. The design of the network determines the size of each community. The size of micro-cellular community in urban environment is generally 300 m, and some macro-community in rural environment can reach 30 km [11]. All adjacent cells are overlapping, allowing a continuous connection to the network when the mobile equipment is moving. Many adjacent cells are grouped in zones identified by a local area code (LAC) [11]. Operators will keep detailed records of mobile devices in use. These records are called call detail records. CDR data generally includes time stamp, cell number, IMEI (International Mobile Equipment Identity), and time type. This information is highly correlated in spatial-temporal.

The data used in this paper is the CDR data provided by Telecom Italia. Traffic data in Milan on November 2, 2013, was adopted. The statistical time granularity is 10 min [12]. Geographical grids for data records are defined. The map of the city is divided into 100 × 100 grids. As shown in Fig. 3, each grid has a unique square ID covering an area of 235 × 235 square meters. The CDR data format is shown in Fig. 4 [12]. In this paper, we mainly analyze voice traffic data, so for each sampling point, we add the call-in and call-out data to constitute the current traffic.

In the study of communication networks, urban scenarios are often divided into four categories: office, station, entertainment, and residential areas [13,14,15]. In order to improve the universality of the algorithm, we analyze the entertainment area (Quadrilatro della moda) and the office area (Politecnico di Milano). In addition, traffic often increases dramatically in the scenario where major events occur.

Because our causality analysis is based on the causality between the main prediction area and the adjacent grid, our data extraction is centered on the prediction area, and traffic data is also extracted for the four adjacent regions. As shown in Fig. 5, the red area is the selected three scenes, and the surrounding blue grid are the areas for joint analysis.

According to the grid information, we get the corresponding traffic data. Figures 6 and 7 show the traffic changes at the Milan Polytechnic University, San Siro Stadium, and nearby grids on that day. The abscissa in the figure is the sampling time point, and the time granularity is 10 min. Different colors represent different regions. Among them, the blue-centered area is also the traffic change trend of the analyzed area.

### Granger causality test

Causality can be defined by the dependence between variables, that is, the variable as the result is determined by the variable as the cause, and the change of the cause variable causes the change of the result variable. Granger points out that if one variable *X* is not helpful in predicting another variable *Y*, then *X* is not the cause of *Y*; on the contrary, if *X* is the cause of *Y*, two conditions must be satisfied [16]:

Firstly, *X* should be helpful in predicting *Y*, that is, in the regression of *Y*’s past value, adding *X*’s past value as an independent variable should significantly increase the explanatory power of regression.

Secondly, *Y* should not be helpful in predicting *X*. The reason is that if *X* is helpful in predicting *Y* and *Y* is helpful in predicting *X*, there may be one or more other variables, which are both the cause of *X* change and the cause of *Y* change.

Now, people generally call this causality defined from the perspective of prediction Granger causality. Specifically speaking, for the data we extracted, let us assume that *X* is traffic data in the central area, which is sampled at different time points {*X*_{1}, *X*_{2}, *X*_{3}, …*X*_{n}}. Where *n* is the total number of training set samples. Time series *Y* is traffic data in one of the adjacent regions, from {*Y*_{1}, *Y*_{2}, *Y*_{3}, …*Y*_{n}}. Now, we use the past of *X* to predict the future of *X*. For example, we use *X*_{1}~*X*_{n − j} (which is the past value of *Y*) to predict *X*_{n − j + 1}~*X*_{n} (which is the past value of *X*). In the process of prediction, we produce an error of *δ*_{1} and then regard this error as the first result we get.

Then, we use the past of *X* and *Y* to predict the future of *X*, such as {*X*_{1}~*X*_{n − j}| *Y*_{1}~*Y*_{n − j}} to predict *X*_{n − j + 1}~*X*_{n}, and an error of *δ*_{2} is generated in the process of prediction. If *δ*_{1} is less than *δ*_{2}, that is to say, the combined prediction error of *X* and *Y* is less than the prediction error of *X* itself, then it must be because *Y* is helpful to the prediction of *X*, so the prediction error is reduced. In this case, we call *Y* Granger cause to *X* [17].

So, whether variable *X* is the Granger cause of variable *Y* is testable. But some processing of data is needed before testing.

One of the basic conditions of Granger’s test is the stability of the series, so before the causality test, we first ensure that these time series are stable, and the time series in the communication network are often complex and non-stationary. This requires some pre-processing before verification. The arithmetic flow chart of the causal analysis module in this paper is shown in the Fig. 8 [18].

The first part is de-trending. The sequence with obvious trending is non-stationary, so we remove the trend from the sequence. Granger test requires data to fluctuate around the horizontal axis, so demeaned processing is needed. After processing in this part, take the data in stadium area as an example, the processed data is shown in the Fig. 9.

Subsequently, unit root test is done for time series data. If the conclusion is that the time series has unit root, then we can be sure that the sequence must not be stationary, and then carry out the subsequent differential processing. On the contrary, subsequent analysis is carried out through AIC criteria.

Our goal is to search lag in the range of 1~*n* to minimize the value of AIC. And the lag that makes the AIC minimum is the order lag we want. The last step is the normal distribution test and consistency test [18]. Since error obeying normal distribution is a prerequisite for solving regression problems by least square method, the purpose of normal distribution test is to detect whether the residual after regression is obeying normal distribution. If it is not obeyed, the data does not satisfy the precondition of using least squares method and the basis of solving Granger causality.

For consistency test, when the data points of time series are regressed. It is not possible to determine whether the theoretical and actual values obtained by regression come from the same distribution. At this time, consistency test should be adopted. If the conclusion of the consistency test shows that the gap between theoretical value and actual value is small, the regression results are good.

So far, we have completed all the processing steps before causality test. In fact, after such processing and analysis steps, we can do a Granger causality test which is somewhat demanding for data for a complex time series in urban communication network. In causality checking, we proceed step by step [19].

Step 1: Test the original hypothesis: *X* is not the Granger cause of *Y*. First, we estimate the following two regression models:

$$ {Y}_t={\alpha}_0+\sum \limits_{i=1}^p{\alpha}_i{Y}_{t-i}+\sum \limits_{i=1}^q{\beta}_i{X}_{t-i}+{\varepsilon}_t $$

(1)

$$ {Y}_t={\alpha}_0+\sum \limits_{i=1}^p{\alpha}_i{Y}_{t-i}+{\varepsilon}_t $$

(2)

Among them, *α*_{0} denotes the constant term, *P* and *Q* are the maximum lags of *Y* and *X*, respectively, *ε*_{t} is the white noise. Then *F*-statistics are constructed by sum of residual squares of two regression models which are RSS_{u} and RSS_{r}.

$$ F=\frac{\raisebox{1ex}{$\left({\mathrm{RSS}}_r-{\mathrm{RSS}}_u\right)$}\!\left/ \!\raisebox{-1ex}{$q$}\right.}{\raisebox{1ex}{${\mathrm{RSS}}_u$}\!\left/ \!\raisebox{-1ex}{$\left(n-p-q-1\right)$}\right.}\sim F\left(q,n-p-q-1\right) $$

(3)

Among them, *n* is the number of samples. The original hypothesis can be tested by function 3. If *F* ≥ *F*_{ɑ} (*q*, *n* − *p* − *q* − 1), then *β*_{1}, *β*_{2}, … *β*_{q} is significantly not 0. We should reject the hypothesis that *X* is not the Granger cause of *Y*; on the contrary, we cannot reject this hypothesis.

Step 2: Exchange the positions of *Y* and *X*, and test the original hypothesis in the same way: “*Y* is not the Granger cause of *X* change”.

Step 3: To reach the conclusion that “*X* is the Granger cause of *Y*,” we must reject the original hypothesis that “*X* is not the Granger cause of *Y*” and accept the original hypothesis that “*Y* is not the Granger cause of *X*.”

We use this method to analyze the traffic sequence of each region. In this paper, the data are divided into test set and training set. The training data are traffic data from 0:00 to 18:00, and the data from 18:00 to 24:00 are test data. According to the spatial distribution, multiple time series can be obtained. Causality checking is carried out among multiple time series to obtain the causality between traffic in adjacent areas and traffic changes in central areas. It is equivalent to finding out the “cause” of traffic change in the main forecast area and combining the traffic data of the main forecast trend with these causal data to analyze and forecast the future value of the main area.

After causality checking, causality diagrams showed in Fig. 10 can be obtained. Causality diagram is a topological diagram of causality [20]. Each vertex represents a different area. In the diagram, the causal diagram is based on the San Siro Stadium (a major event scenario). A is the central area, that is, the stadium. It is also the area where we will ultimately predict the traffic. The arrow represents the impact on the data of an area. As can be seen from the Fig. 10, it is the data of B and D regions that affect the data of central area A. Therefore, the data from these three regions are used for subsequent multivariate time series prediction which will be described in detail in the next chapter.

### LSTM prediction algorithms for multivariate time series

According to the multivariate data obtained from causal analysis, the traditional linear model is difficult to solve the multivariate or multi-input problem, while the neural network such as LSTM is good at dealing with the problem of multiple variables, which makes it helpful to solve the problem of time series prediction. So multivariate LSTM algorithm is used for prediction.

LSTM is a special form of RNN, which is widely used in time series analysis, especially for multivariate time series analysis. The difference between LSTM and RNN is that every neuron in LSTM is a memory cell. LSTM stores previous information in current neurons. As shown in Fig. 11, each neuron contains three gates: input gate, output gate, and forgetting gate. Through these internal doors, the long-term dependency problem can be solved [9, 21].

The yellow part of the picture is the forgetting gate (Eq. 4). The first step in LSTM is to decide what information we will discard from the cell state. This decision is made through a layer called the forgetting gate. The gate reads *h*_{t − 1} and *X*_{t} and outputs a value between 0 and 1 to each number in the cell state (Eq. 5). 1 means “complete reservation” and 0 means “complete abandonment”. *σ* denotes sigmoid function.

$$ {f}_t=\sigma \left({W}_f\cdot \left[{h}_{t-1},{X}_t\right]+{b}_f\right) $$

(4)

$$ S(t)=\frac{1}{1+{e}^{-t}} $$

(5)

The green part is the input gate, which determines how much new information is added to the cell state. There are two steps to achieve this: first, a sigmoid layer called “input gate layer” determines which information needs to be updated (Eq. 6) and second, a tanh layer generates a vector, which is the alternative content for updating. Cell status is updated by combining these two parts (Eq. 7).

$$ {i}_t=\sigma \left({W}_i\cdot \left[{h}_{t-1},{x}_t\right]+{b}_i\right) $$

(6)

$$ {\tilde{C}}_t=\tanh \left({W}_C\cdot \left[{h}_{t-1},{x}_t\right]+{b}_C\right) $$

(7)

The red part is the output gate. First, we run a sigmoid layer to determine which part of the cell’s state will be exported (Eq. 8). Next, we process the cell state through tanh (to get a value between − 1 and 1) and multiply it with the output of the sigmoid gate (Eq. 9). Eventually, we only output the part of the output we determined [22].

$$ {o}_t=\sigma \left({W}_o\cdot \left[{h}_{t-1},{x}_t\right]+{b}_o\right) $$

(8)

$$ {h}_t={o}_t\ast \tanh \left({C}_t\right) $$

(9)

The author fits our LSTM model with keras, a deep learning library [23]. Take the stadium scene data as an example. The first hidden layer defines a LSTM with 50 neurons and a neuron in the output layer used to predict contamination. The time step of input data is 1, which has three characteristics, namely, three input variables, the data of the central region and the data of the two regions which cause the change of the data of the central region after causal analysis. The output variable is the prediction result of traffic volume in the central area. The data are divided into test set and training set as mentioned above.