In this section, we defines Q-learning and curiosity-driven for their application to VNE problem, and presents a description of the virtual network multi-objective embedding algorithm based on Q-learning and curiosity-driven (Q-CD-VNE).

### VNE based on Q-learning algorithm

We employed Q-learning algorithm of reinforcement learning to solve the MDP process. It allows agents to automatically determine the ideal behavior within a specific environment for maximizing its performance. Simple reward feedback is required for the agent to learn its behavior. The agent’s living environment in each episode is described as a state set S, which can perform any possible action described as action set A as depicted in Fig. 1 Each time an action *a*_{
t
} is performed in the state *s*_{
t
}, the agent receives a reward *r*_{
t
}, thus generating a series of states *s*_{
i
}, a set of actions *a*_{
i
} and reward *r*_{
i
} till the end of the episode. It finds the action sequence *π* iteratively for maximizing all reward values as an optimal strategy as shown in Eq. (14).

$$ {\pi}^{\ast}\left(\mathrm{S}\right)=\underset{a}{\arg\ \max}\left[r\left(s,a\right)+\gamma {v}^{\ast}\left(\delta \left(s,a\right)\right)\right] $$

(14)

For finding optimal strategies, the approximate behavioral value function Q is usually used. After successful mapping of a virtual node, the system deliverse a *Q* value to the agent. The *Q* value matrix is obtained by approximating the behavior value function *Q*(*s*_{
t
}, *a*_{
t
}) gradually. This matrix can be used in the next episode to find the node with the highest reward value quickly for embedding,

where the *Q* value update strategy function is expressed as:

$$ Q\left(s,a\right)=R\left(s,a\right)+\gamma \mathrm{maxQ}\left(\delta \left(s,a\right),{a}^{\prime}\right) $$

The estimate of *Q* is expressed as:

$$ \widehat{Q}\left(s,a\right)\leftarrow R\left(s,a\right)+\gamma \underset{a^{\prime }}{\max}\widehat{Q}\left({s}^{\prime },{a}^{\prime}\right) $$

(15)

The convergence of \( \widehat{Q} \) to *Q* has been proven [17]. Here, *R*(*s*, *a*) is the reward value of the action *a* in the state *s*, *Q*(*δ*(*s*, *a*), *a*^{′}) is the *Q* value of all the next actions *a*^{′} after the action *a* is performed in the state *s*, and *γ* is the conversion factor which is used to weigh the relationship between the current reward value and the subsequent reward value.

At a time *t*, the *Q* value update process of the virtual network node mapping is shown in Fig. 2:

The solid lines with arrows in Fig. 2 represent the sequence of actions that have been mapped successfully, with the dashed arrows pointing to the selectable mapping nodes. This figure represents a network diagram for selecting the next action, that is, embedding the virtual node *b* when the virtual node *a* is successfully mapped to the physical node A. The detailed process is described below.

$$ {S}_1=\left(\left({N}_t^a,{N}_t^b\right),\left({N}_t^A,{N}_t^B,{N}_t^C,{N}_t^D\right)\right) $$

After the action \( {n}_t^a\_{n}_t^A \) is performed in state *S*_{1}, the agent arrivals at state *S*_{2}.

$$ {S}_2=\left(\left({N}_t^b\right),\left({N}_t^B,{N}_t^C,{N}_t^D\right)\right) $$

At state *S*_{2}, the mappable action set \( {A}_2\left\{{n}_t^b\_{n}_t^B,{n}_t^b\_{n}_t^C,{n}_t^b\_{n}_t^D\right\} \) satisfies the resource of VNRs. The *Q* matrix is initialized to an all-zero matrix. The *Q* values \( \widehat{Q}\left({S}_2,{n}_t^b\_{n}_t^B\right) \), \( \widehat{Q}\left({S}_2,{n}_t^b\_{n}_t^C\right), \) and \( \widehat{Q}\left({S}_2,{n}_t^b\_{n}_t^D\right) \) are calculated according to the Q value update strategy formula (13) when the system selects the action which is as follows.

$$ \widehat{Q}\left({S}_2,{A}_2\right)\leftarrow R\left({S}_2,{A}_2\right)+\gamma \underset{a^{\prime }}{\max}\widehat{Q}\left({s}^{\prime },{a}^{\prime}\right)=\max \left({R}_{S_2}^e\right)+\gamma \underset{a^{\prime }}{\max}\widehat{Q}\left({s}^{\prime },{a}^{\prime}\right)=\max \left({n}_t^b\_{n}_t^B,{n}_t^b\_{n}_t^C,{n}_t^b\_{n}_t^D\right)+\max \left(0,0,0\right)=1-\min \left(f\left(\mathrm{energy}\right)\right) $$

For simplicity, we ignored the values of ω, γ, and internal signals, where max_{energy} is calculated as follows.

$$ {\max}_{\mathrm{energy}}=\max \left({P}_{\mathrm{idle}}+\left(\mathrm{CPU}(i)-{P}_{\mathrm{idle}}\right)\frac{\mathrm{ReqCPU}(i)}{\mathrm{CPU}(i)}\right) $$

where CPU(*i*) ∈ [*a*, *b*], and ReqCPU(*i*) ∈ [*c*, *d*.]

We draw this functional image in MATLAB and find the maximum value of this function. This function has an extreme in the interval for ReqCPU(*i*) = *c* and CPU(*i*) = *b*. We can obtain the action with the largest reward value using above cited expressions. Following this action, the reward value is recorded in the *Q* matrix as the physical node evaluation standard of the mapping, and the next state *S*_{3} is started. This is repeated until the *Q*-value matrix reaches the convergence state, and the system can select the next action *A* as per the current *Q*-value matrix by considering energy saving and VNR acceptance rates.

### The reward calculation method based on curiosity-driven

In this section, we present a curiosity-driven mechanism for generating internal signals and passing them to external reward generators for mining other non-deterministic factors. This method can achieve a trade-off between exploration and exploitation and avoids falling into local optimum.

The traditional reinforcement learning method lacks the prediction of unknown environment for making decisions with the highest reward called exploitation-only. This causes a mismatch between the exploitation value and the exploration value and eventually leads to fall into a local optimum. For example, in the virtual network mapping process, the agent tends to repeatedly select nodes that perform well in the previous training process after much iteration. It may lead to deterioration of the substrate network connectivity and magnify the fragmentation rate of substrate network. Thus, it can be concluded that only considering the performance of deterministic factors can easily ignore the performance of non-deterministic factors. Therefore, it is suggested to consider prediction to the environment for selecting action a, so that the agent can adjust the mapping strategy before the formation of fragment in substrate network.

As shown in Fig. 3, the agent joins a curiosity-driven mechanism, by adding internal signals for external rewards and predicting the results of its own actions in a self-monitoring manner. In the iterative process, the internal curiosity mechanism will motivate agents to explore their own predictions of action by exploring the environment.

The curiosity-driven mechanism consists of external reward generator and internal signal generator as described in following paragraphs.

#### External reward generator

It is used to calculate rewards from deterministic factors, such as energy saving and VNR acceptance rate. It is called external reward \( {r}_t^e \) and calculated using Eq. (13).

#### Internal signal generator

Agent trains its proficiency by responding to the changes in the environment by predicting movements and training familiarity with the surrounding environment through predictions of the next environment, so that corresponding changes can be made before the environment changes. Then, they explore other factors that affect the performance of VNE to generate signals which guide the generation of total rewards, denoted as \( {r}_t^i \). The calculation process is as described below.

#### Step 1

It predict the action \( {\widehat{a}}_t \) that causes the changes in states by training a neural network amounts to learning function *g*(∙). The current state is denoted as ∅(*s*_{
t
}), take as inputs ∅(*s*_{
t
}) and ∅(*s*_{t + 1}) and predict action \( {\widehat{a}}_t \) makes the agent from state *s*_{
t
} reach state *s*_{t + 1} which defined as

$$ {\widehat{a}}_t=g\left({s}_t,{s}_{t+1};{\theta}_I\right) $$

where \( {\widehat{a}}_t \) is the predicted value of the action *a*_{
t
}, *θ*_{
I
} is trained by \( \underset{\theta }{\min }{L}_I\left({\widehat{a}}_t,{a}_t\right) \), and *L*_{
I
} is the softmax loss function measures the difference between predicted behavior and actual behavior.

#### Step 2

It predicts the next state \( \widehat{\varnothing}\left({s}_{t+1}\right) \) generated by the action \( {\widehat{a}}_t \) under the current environment ∅(*s*_{
t
})by training another neural network amounts to learning function *f*(∙).The state at *t* + 1 is predicted by taking inputs as \( {\widehat{a}}_t \) and ∅(*s*_{
t
}).

$$ \widehat{\varnothing}\left({s}_{t+1}\right)=f\left(\varnothing \left({s}_t\right),{\widehat{a}}_t;{\uptheta}_{\mathrm{F}}\right) $$

where \( \widehat{\varnothing}\left({s}_{t+1}\right) \) is the predictor of ∅(*s*_{
t
}) and *θ*_{
F
} is minimized by the loss function *L*_{
F
}.

\( \underset{\theta_F}{\min }{L}_F\left(\varnothing \left({s}_t\right),\widehat{\varnothing}\left({s}_{t+1}\right)\right)=\frac{1}{2}{\left\Vert \widehat{\varnothing}\left({s}_{t+1}\right)-\varnothing \left({s}_{t+1}\right)\right\Vert}_2^2 \).

#### Step 3

It calculates the internal signal \( {r}_t^i \) by predicting the next state:

$$ {r}_t^i=\left\{\begin{array}{c}0, if\ \widehat{\varnothing}\left({s}_{t+1}\right)\ \mathrm{causes}\ {\mathrm{SN}}^{\hbox{'}}\mathrm{s}\ \mathrm{connectivity}\ \mathrm{changes}\ \\ {}{e}^{-\frac{{\left(\widehat{\varnothing}\left({s}_{t+1}\right)-\varnothing \left({s}_{t+1}\right)\right)}^2}{{\left(\varnothing \left({s}_{t+1}\right)-\widehat{\varnothing}\left({s}_{t+1}\right)\right)}^2}},\kern1.5em \mathrm{otherwise}\end{array}\right. $$

This assignment method is reference [18]. After computing the values for internal signal and the external reward generated by the deterministic factors, it is superimposed as a new reward to guide the agent to make a decision. Therefore, *r*_{
t
} is defined as

$$ {r}_t={r}_t^i+{r}_t^e $$

In summary, the proposed method involves the agent to obtain the current VNE environment such as the situation of the substrate network resources, the link connection status, and the request volume of the virtual network through the MDP model. At the beginning of each episode, the first mapped virtual node is randomly transported to a physical node in the set of executable actions, then the curiosity-driven mechanism obtains the total reward value (composed of internal signals and external rewards), recorded in the *Q* matrix, and then moves to the next state *s*_{t + 1}. Through this adaptive learning scheme, the potential impact factors can be mined by taking into account energy saving and VNR acceptance rates, for obtaining a global optimal mapping method.

### Algorithm description

It can be concluded from the description cited process that Q-learning algorithm mixed with curiosity-driven mechanism can ensure the energy saving and VNR acceptance rate performance and avoid falling into local optimum. So, the proposed algorithm is described as below.