There are three main DL models that are mostly employed in the current CF M-MIMO literature; feed-forward neural networks (FFNNs), recurrent neural networks (RNNs) and convolutional neural networks (CNNs). In this section, the details about these models are briefly discussed, providing further references for the interested reader. In addition, Reinforcement Learning (RL), one of the three major ML paradigms (the other two are Supervised Learning and Unsupervised Learning), is briefly presented in this section.

### Feed-forward neural networks

The key component of every DL architecture is the notion of the artificial neuron. Artificial neurons are computational units (functions) that try to mimic in mathematical terms the behavior of biological neurons. The functionality of such a unit is very simple; the neuron takes some input and produces an output (or an activation, with respect to the action potential of a biological neuron). For inputs in vector form, each individual input is weighted in a separated way, and then the sum is passed through a nonlinear function known as an activation function [7].

A FFNN is the first proposed architecture comprised of multiple artificial neurons. In such a model, the connections between the nodes do not form a cycle or a loop and the information is transferred only forward. Both the training and the learning are achieved through the stochastic gradient descent (SGD) algorithm. The pseudo-code for the SGD algorithm is given in Alg. 1.

where \(\eta\) is the learning rate, *w* the weight and *C* the loss function (cost) which computes the distance between the current output of the algorithm and the expected output.

There are many techniques for making FFNNs work efficiently, but the most frequently used is the back-propagation [7] algorithm. For the loss function, back-propagation technique utilizes the chain rule from calculus, computing the gradient one layer at a time, iterating backward from the last layer. In Fig. 2, the basic architecture of a FFNN with one hidden layer is shown.

The mathematical formulation of FFNNs (that can be extended to other DL architectures) considers an input vector \({\mathbf {x}}\), a set of weights \({\mathbf {w}}\), a bias *b* and an activation function *f*. The output of the last layer is

$$\begin{aligned} {\mathbf {y}} = f({\mathbf {w}}^{T}\cdot {\mathbf {x}}+b) \end{aligned}$$

(1)

As mentioned before this output is compared with the expected output vector.

### Recurrent neural networks

As stated before FFNNs’ architecture does not contain either cycles or loops (acyclic graphs). A different approach is introduced with Recurrent Neural Networks (RNNs). RNNs adopt a notion of memory by utilizing their internal state to process variable length sequences of inputs [7, 11]. This characteristic makes them suitable for tasks such as time series forecasting, handwriting recognition or speech recognition [11]. There are many RNN architectures, such as Gated Recurrent Units (GRUs), Bi-directional RNNs, Hopfield networks, etc.

A very successful variant of RNNs are the long short-term memory (LSTM) networks [12, 13]. The building blocks of LSTMs are cells which have an input gate, an output gate and a forget gate. The main advantage over classical RNNs is that this type of cell is capable of remembering values over arbitrary time intervals. In addition, the flow of information is regulated by the three aforementioned gates [12].

In mathematical terms, a LSTM network can be formulated in the following way. For an input vector \({\mathbf {x}}_{t}\in {\mathbb {R}}^{N}\) at time step *t*, and *M* hidden layers, the forget gate’s activation vector \({\mathbf {F}}_{t}\in (0,1)^{M}\) is given by

$$\begin{aligned} {\mathbf {F}}_{t} = \sigma (W_{F}{\mathbf {x}}_{t}^{T} + U_{F}{\mathbf {q}}_{t-1}^{T} + {\mathbf {b}}_{F}) \end{aligned}$$

(2)

where \(W_{F}\) and \(U_{F}\) are matrices of weights, \(q_{t}\in (0,1)^{M}\) is the hidden state vector and \(b_{F}\) is the bias vector.

In addition, the input/update gate’s activation vector \({\mathbf {I}}_{t}\in (0,1)^{M}\) and the output’s activation vector \({\mathbf {O}}_{t}\in (0,1)^{M}\) are expressed in a similar way,

$$\begin{aligned} {\mathbf {I}}_{t} = \sigma (W_{I}{\mathbf {x}}_{t}^{T} + U_{I}{\mathbf {q}}_{t-1}^{T} + {\mathbf {b}}_{I}) \end{aligned}$$

(3)

and

$$\begin{aligned} {\mathbf {O}}_{t} = \sigma (W_{O}{\mathbf {x}}_{t}^{T} + U_{O}{\mathbf {q}}_{t-1}^{T} + {\mathbf {b}}_{O}) \end{aligned}$$

(4)

where the subscripts *I* and *O* mean input and output, respectively, and the other symbols have the same meaning as previously.

A LSTM unit has also a cell input activation vector \({\mathbf {C}}_{t}\in (-1,1)^{M}\), which is given by,

$$\begin{aligned} {\mathbf {C}}_{t} = \sigma (W_{C}{\mathbf {x}}_{t}^{T} + U_{C}{\mathbf {q}}_{t-1}^{T} + {\mathbf {b}}_{C}) \end{aligned}$$

(5)

Combining the above equations, the cell state vector and the hidden state vector are updated with the following rules

$$\begin{aligned} {\mathbf {S}}_{t} = {\mathbf {F}}_{t} \circ {\mathbf {S}}_{t-1} + {\mathbf {I}}_{t} \circ {\mathbf {C}}_{t} \end{aligned}$$

(6)

where \(\circ\) denotes the Hadamard product, \(S_{0}=0\) and \(q_{0}=0\). Finally,

$$\begin{aligned} {\mathbf {q}}_{t} = O_{t}\circ \tanh ({\mathbf {S}}_{t}) \end{aligned}$$

(7)

Although the DL research field seems to move towards replacing LSTMs with Transformers for many tasks [14], LSTMs still remain one of the most commonly used architectures.

### Convolutional neural networks

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC), has been proved a driving force for many advancements in computer vision (CV) [15]. The dominant paradigm in this field is the application of convolutional neural networks (CNNs). CNNs utilize the convolution operation instead of the general matrix multiplication [7, 16]. CNNs have found success not only in CV tasks but also in time series forecasting, video processing, natural language processing, etc. [7].

CNNs are composed of at least one convolutional layer and often fully connected layers and pooling layers, as shown in Fig. 3. The latter reduces the size of the incoming data. In contrast to a fully connected layer, in a convolutional layer exists the so-called neuron’s receptive field, which means that every single neuron receives input from only a restricted area of the previous layer.

Most CNNs use at some point as an activation function the Rectified Linear Unit (ReLU) function, or variants. ReLU is simply defined as [7],

$$\begin{aligned} f(x) = \max (0,x) \end{aligned}$$

(8)

Despite its mathematical simplicity, its use has been proved valuable in order to avoid over-fitting.

### Reinforcement learning

Reinforcement Learning (RL) is one of the three major ML paradigms [7]. In this framework, a learning agent is able to act within its environment, take actions and learn through trial and error. Finding a balance between exploration of an “unknown space” and exploitation of its “current knowledge”, the agent maximizes a cumulative reward [17]. Recently the combination of DL architectures with RL (DRL) in a unified setup has provided solutions to many difficult problems [18].

RL has many similarities with the fields of dynamic programming and optimal control [19]. In this context, the environment is often modeled as Markov decision process (MDP). However, RL methods do not always assume the knowledge of an exact model of the environment. Formally, a MDP is defined as a 4-tuple \((\textit{S}, \textit{A}, \textit{P}, \textit{Q})\), where *S* is the state space, *A* is the action space, *Q* is the immediate reward and the probability that action \(a_{1}\) in state \(s_{1}\) at time *t* will lead to state\(s_{2}\) at time t + 1 is given by

$$\begin{aligned} P_{a_{1}}(s_{1},s_{2}) = Pr(s_{t+1}=s_{2}|s_{t}=s_{1}, a_{t}=a_{1}) \end{aligned}$$

(9)

Deep Q-Learning (DQL) is a branch of DRL which is utilized for many tasks in different areas, including wireless communications. In DQL, a Q-value is an estimation of how good it is to take the action *A* at the state *S* at time *t*. In this way, a matrix is created where the agent can refer to in order to maximize its cumulative reward [17].

$$\begin{aligned} Q_{\pi }(S_{t}, A_{t}) = {\mathbb {E}}\left[ \sum _{k}\gamma ^{k}R_{t+1+k}|S_{t}, A_{t}\right] , k = 0,1,\ldots \end{aligned}$$

(10)

The realization that the matrix entries have an importance relative to the other entries, leads us to approximate the matrix values with a deep neural network.

Another frequently used DRL method that exploits DQL characteristics is the deep deterministic policy gradient (DDPG) algorithm. DDPG is an “off-policy” method that consists of two entities; the actor and the critic. The actor is modeled as a policy network and its input is the state and its output the exact continuous action (instead of a probability distribution over actions). The critic is a Q-value network that takes in state and action as input and outputs the Q-value. [20].