To have a better understanding of the following sections, a brief introduction about reinforcement learning and deep Q network is illustrated in Section 5.1. However, it is far from a comprehensive survey, and its objective is just to provide the basic knowledge for readers.

### Reinforcement learning and deep Q network

The process of RL is usually formulated as a Markov Decision Process (MDP), where time is divided into a series of time steps *t* = 1, 2…. And there are four variables in RL procedure: state space *S,* action space *A*, probability transition matrix for states *P*, and rewards *R.* The objective of RL is to make the optimal action policy according to the immediate reward from the feedback of interaction with the environment through exploration and exploitation. The state is a description of the environment that RL agent perceives, and correspondingly, state space is the set of states. The action space denotes the possible action set that the agent can choose at each time step *t*.

The mapping from the state space to the action space is defined as a strategy *π*. Specifically, at any time step *t*, the agent observes the present state *s*_{t} = *s*∈*S* and takes an action *a*_{t} = *a*∈*A* in terms of the strategy *π.* Then, the agent gets feedback from the environment, that is, an immediate reward. At the same time, there is a transition from *s*_{t} to a new state *s*_{t+1} = *s'*∈*S* in accordance with *Pss'* (*a*)∈*P*. The agent repeats this process by trial and error and learns the optimal strategy with the objective of maximizing the accumulative rewards in the end. It is worth mentioning that both state transition and feedback procedures from the environment are not controlled by the agent in the learning phase. The agent just has an effect on the environment by actions choice and perceives the environment further according to the gained reward.

Deep reinforcement learning (DRL) can be recognized as an enhanced version of RL, which has attracted considerable concerns from academia and industry due to AlphaGo of the DeepMind team [31]. DRL has obvious advantages in decision-making with respect to the issues of high-dimension state in comparison with the traditional RL, and it is applied to deal with complicated control issues. The action value function-based DRL and deep deterministic policy gradient (DPPG) are two basic DRL methods [37]. The deep neural network (DNN) is utilized to approximate the action value function in the former. However, DNN in DPPG is used to near the policy on basis of the policy gradient to learn the optimal strategy. The distinction is that the former is appropriate to solve discrete problems and the latter is widely used for continuous problems.

The objective of RL is to evaluate the action value function and greedily select the state-action pair with the maximal Q value to maintain Q table; TD-learning and Q-learning are classical algorithms based on state value function and action value function. DQN is the integration of Q-learning and deep neural network, and it is a value iteration algorithm-based on the Q network.

Nevertheless, there are so many state-action pairs in practice, the tabular algorithm (e.g., Q-learning) may encounter with dimension curse in accompany with more memory occupation and computation. Moreover, sparse samples may lead to instability, slow or even no convergence of the algorithm. Therefore, three countermeasures are taken to improve DQN performance. Firstly, to solve the problem of dimension explosion, the deep convolutional neural network (CNN) is used to estimate the action value-integrated DNN with Q-learning. Secondly, to break the temporal correlations among samples, the experience replay method is exploited in DQN. Specifically, at each time step *t*, the *i*th transition *e*^{(t)} = (*s*^{(i)}, *a*^{(i)}, *r*^{(i)}, *s'*^{(i)}) is stored into the experience memory \( \mathcal{D}=\left\{{e}_1,\dots {e}_{\mathcal{M}}\right\} \). At the training phase, the statistic gradient descent algorithm is employed to update the neural network parameters on the basis of a minibatch of state transitions from \( \mathcal{D} \). Finally, the target network is applied to predict the target Q value and improve training stability. The target network is considered an earlier snapshot of the Q network. And thus, it has the identical neural structure to the Q network.

In the training process, the optimal objective of the Q network \( r+\gamma {\max}_{a\hbox{'}}Q\left(s\hbox{'},a\hbox{'};{\theta}_i^{-}\right) \) derives from the target network, where *s'* is the next state, *a'* is the possible action, and \( {\theta}_i^{-} \) are target network parameters at iteration *i*. The parameters \( {\theta}_i^{-} \) are updated with *θ*_{i} per *ℏ* iterations and \( {\theta}_i^{-} \) remain unchanged in the rest procedures.

The Q network is updated with the goal of minimizing the loss function *L*_{i}(*θ*_{i}) at each iteration *i* [38]:

$$ {L}_i\left({\theta}_i\right)={E}_{s,a\sim \rho (.)}\left[{\left(r+\gamma {\max}_{a\hbox{'}}Q\left(s\hbox{'},a\hbox{'};{\theta}_i^{-}\right)-Q\left(s,a;{\theta}_i\right)\right)}^2\right], $$

(14)

Where *Q*(*s*, *a*; *θ*_{i}) is the value produced via Q network, and *ρ* (*s*, *a*) is the probability distribution of state-action pair (*s*, *a*).

After that, taking the derivative of the loss function with respect to *θ*_{i}, the corresponding gradient could be cast as

$$ {\displaystyle \begin{array}{l}{\nabla}_{\theta_i}{L}_i\left({\theta}_i\right)={E}_{s,a\sim \rho (.)}\left[\right(r+\gamma {\max}_{a\hbox{'}}Q\left(s\hbox{'},a\hbox{'};{\theta}_i^{-}\right)\\ {}\kern5.25em -Q\left(s,a;{\theta}_i\right)\left){\nabla}_{\theta_i}Q\left(s,a;{\theta}_i\right)\right],\end{array}} $$

(15)

$$ {Q}^{\ast}\left(s,a\right)=E\left[r+\upgamma \mathrm{max}{\hbox{'}}_a{Q}^{\ast}\left(s\hbox{'},a\hbox{'}\right)|s,a\right], $$

(16)

The optimal strategy can be acquired in terms of the Bellman equation:

$$ {\uppi}^{\ast }(s)={\underset{a\in A}{\mathrm{argmax}\ Q}}^{\ast}\left(s,a\right). $$

(17)

### Problem formulation

In order to achieve fast rerouting for services in large-scale failures, the rerouting recovery problem is modeled as a model-free strategy learning process in this section. Figure 5 shows the schematic DQN execution process under SDN communication architecture. And here, the SDN controller can be regarded as the agent, the data plane consists of switches deployed in substations and the control center. The upper control plane is composed of TED (traffic engineering databases) and PCE (Path Computation Elements). The TED module is responsible for network topology and network connections. The PCE module supports the manner of centralized routing computation. Aiming to the multi-service rerouting problem in the large-scale failure scenario, it is extremely important to precisely identify the three metrics: the environment and state, the action, and the corresponding reward. We define them as follows.

#### Environment and state

The agent is responsible for making intelligent decisions and policy deployment in DRL. As can be seen from Fig. 5, in the SDN-enabled SGCN, the SDN controller with the global view of the network topology is able to accomplish the collection of networking parameters and service information, path computation and routing deployment, traffic management, etc. And thus, it is regarded as the agent of deep reinforcement learning. The observed environment includes the current network topology and the interrupted services due to the damaged nodes or links.

In addition, in light of the requirement of low latency for services in SGCN, it is imperative that each path should satisfy the service latency requirement, or else the consideration about the path survivability and site difference level is meaningless for rerouting. Hence, to reduce computation complexity and guarantee service performance, our solution is to find all the possible paths for services based on depth first search algorithm (DFS) and search the available service rerouting set. After that, the DNN is introduced to learn the optimal path combination for all services under the framework of DRL. Therefore, the state is defined as follows:

$$ {s}_t=\left\{\left[{B}_1(t),{I}_1(t)\right],\left[{B}_2(t),{I}_2(t)\right],...\left[{B}_k(t),{I}_k(t)\right]\right\}. $$

(18)

Where *B*_{k}(*t*), *I*_{k}(*t*) represent bandwidth requirement and the index of paths in available rerouting set, respectively. To improve the learning efficiency and get a better state representation, it is important to note that the metric of the bandwidth requirement should be normalized in advance in practice.

In fact, many nodes or links are damaged in large-scale failures due to earthquake occurrence, and many services are interrupted subsequently. Therefore, it leads to a larger state space in DRL. However, the appropriate rerouting set for services is enumerable, and the improved deep Q networks (I-DQN) algorithm for rerouting is designed based on action value function in this paper.

#### Action

As to the concurrent rerouting for all interrupted services, the action space is all the path combinations in terms of specific objectives; therefore, it is a discrete problem. If the action is designed to sequentially select paths from the available set, the action space is *A* = {*u*^{k}}, where *u* is the number of paths for *b*_{k} in the available rerouting set. Note that the action space size is varied with the number of services exponentially. Since there were a large number of services interrupted in large-scale failures, the action space is correspondingly very huge.

The path combinations are sorted in ascending order in terms of latency. We assume *l* to be the index of path combination randomly initially. Consequently, the action space is divided into two parts by *l*: the upper part is \( {l}_p^{\prime }=\left\{{a}_p|{a}_p\in A,0\le {a}_p<l,{\tau}_a^p>{\tau}^{\ast}\right\} \) and the lower part is \( {l}_b^{\prime }=\left\{{a}_b|{a}_b\in A,|{u}^k|\ge {a}_b>l,{\tau}_a^b<{\tau}^{\ast}\right\} \), \( A={l}_p^{\prime}\cup \left\{l\right\}\cup {l}_b^{\prime } \), where *τ*^{∗} is the average end-to-end latency which can be obtained from the historical data. \( {\tau}_a^p \) and \( {\tau}_a^b \) are the average latency of the path *a*_{p} and *a*_{b}, respectively. Particularly, if *l* = 0, then *a*_{p} = *l*, and if *l* = max(| *u*^{k}| ), then *a*_{b} = *l*. Such design is based on the latency of current average path and the historical average path. If the two indicators are equal, the value remains unchanged. If the average communication latency of the present state is larger than *τ*^{∗}, the paths with smaller end-to-end latency should be selected from the upper part \( {l}_p^{\prime } \). Otherwise, the path is selected from the lower part \( {l}_b^{\prime } \) to avoid getting stuck on a local optimum. *a*_{t} = {*a*_{p}, *a*_{b}, *l*} ∈ *A*, A is the set of candidate actions at time step *t*. Compared to the initial action space, the newly generated action space is reduced enormously.

#### Rewards

DQN is known to train the neural network under the guidance of the reward, and the agent obtains an immediate reward from the environment in case of choosing *a*_{t} in the state of *s*_{t}. As the goal of DQN is to maximize the accumulated rewards, our objective is to simultaneously maximize path survivability and minimize site difference levels in the rerouting mechanism of SGCN. Therefore, we define the reward function that the DRL agent obtained as follows:

where \( {\mathcal{S}}_{p_k} \), \( {\omega}_{p_k} \) correspond to the path survivability and site difference level for *p*_{k}, respectively. \( {\mathcal{S}}_{p_k} \) ranges from 0 to 1 and the site difference level is a dimensionless non-negative integer. ℓ is the coefficient for balancing the two variables. Given the discrepancies of the above two indicators, to ensure learning efficiency, we normalize them to a united scope in advance.

### Improved DQN based on prioritized resampling

The issues about which transitions to store and which experience to replay have become serious concerns in the approach of DQN and various improved versions [19, 39]. Our research focuses on the latter. Temporal-difference (TD) error is a basic conception in RL, which refers to the difference between the target value function and current value function. And here, the target value function is the sum of the immediate reward and the next state value function. In [40], Mnih proposed a natural-DQN method without considering the disparity of samplings, which randomly select samplings from the replay memory to update neural network parameters. In fact, samples with different magnitudes of TD error have disparate backpropagation impacts [41]. In other words, samples with higher absolute TD-errors correspond to more important backpropagation impacts due to the larger loss. Hence, they should be replayed more often than others, and vice versa. The SARSA and Q-learning algorithms have calculated the absolute values of TD-error ∣*δ*∣.Thus, samplings are restored with the probability, namely ∣*δ*∣, which is a normalized metric.

$$ P(j)=\frac{p_j}{\sum_m{p}_m}. $$

(20)

where *p*_{j} = ∣ *δ*_{j}∣.This prioritization resampling mechanism ensures the higher magnitude TD-error samples to be stored and the lower samples to be erased at the same time, which further prevents the model degradation. Nevertheless, it results in limited samples and insufficient training as well. To enhance utilization efficiency and improve sample diversity [39], designed a prioritized replay sampling mechanism. The sampling probability is proportional to the sample storage priority. The storage priority is determined by |*δ*| which is derived from the last trained sample and avoids the possible bias in the update process. However, it leads to extra time complexity owing to the storage structure based on the binary heap with priority.

To make sure DQN efficiency and higher superiority samples update in a higher probability during the training phase, the coefficients *α* and the bias *β* are introduced in the calculation of sampling probability. The sampling probability of the learning experience *j* is defined as follows:

$$ P(j)=\alpha \times \frac{p_j^{\sigma }}{\underset{m}{\max}\left({p}_m^{\sigma}\right)}+\beta . $$

(21)

To acquire *p*_{j}, the method which is proportional to the magnitude of TD-error is adopted, that is *p*_{j} = ∣ *δ*_{j} ∣ + *ϵ*, where *ϵ* is a very small positive constant. Its role is to overcome the sampling weight approximate to 0 once their error is 0. The exponent *σ* ∈ [0, 1] is exploited to describe the utilization degree of superiority. Especially *σ* = 0 corresponds to uniform sampling. Compared to the previous random sampling, the resampling method enables sample diversity further.

The rerouting algorithm (I-DQN) for the entire services based on the framework of DQN can be divided into two phases: the available rerouting set and various path metrics for services such as path survivability and site difference level are firstly calculated in phase one; after that, DQN is adopted to achieve the optimal path combination for services in phase two. The details of the algorithm are provided as follows: