Facial image super-resolution guided by adaptive geometric features

This paper addresses the traditional issue of restoring a high-resolution (HR) facial image from a low-resolution (LR) counterpart. Current state-of-the-art super-resolution (SR) methods commonly adopt the convolutional neural networks to learn a non-linear complex mapping between paired LR and HR images. They discriminate local patterns expressed by the neighboring pixels along the planar directions but ignore the intrinsic 3D proximity including the depth map. As a special case of general images, the face has limited geometric variations, which we believe that the relevant depth map can be learned and used to guide the face SR task. Motivated by it, we design a network including two branches: one for auxiliary depth map estimation and the other for the main SR task. Adaptive geometric features are further learned from the depth map and used to modulate the mid-level features of the SR branch. The whole network is implemented in an end-to-end trainable manner under the extra supervision of depth map. The supervisory depth map is either a paired one from RGB-D scans or a reconstructed one by a 3D prior model of faces. The experiments demonstrate the effectiveness of the proposed method and achieve improved performance over the state of the arts.


Introduction
Human-centered image and video analysis have gained ever-increasing attention in both academic and industrial areas worldwide in recent years. Since face is a key character of human, machine-aided facial analysis becomes a popular issue in various applications. For example, the manufactories of mobile devices are highly interested in developing both hardware and software systems for the collection of facial images. This paper addresses a traditional issue, i.e., face super-resolution (SR) in the field of facial image analysis.
Face super-resolution, also known as face hallucination for heavily blurred images, aims at recovering a high-resolution (HR) facial image from a low-resolution (LR) one. It is fundamental in a number of applications for facial analysis, such as face alignment [1,2], face recognition [3,4], and E-commerce platforms [5]. When the acquired facial images are of very low resolutions especially in the surveillance videos, it becomes extremely difficult for machines as well as humans to identify useful information therein. Face SR alleviates this problem to some extent by restoring the missing pixels. Face SR is a sub-problem of the general image SR. It is an ill-posed traditional problem that attracts a lot of academic studies and has straightforward applications in the photographic industry. Recently, the advancement of convolutional neural networks (CNNs) has activated a lot of studies on both general image and face SR. The state-of-the-art methods [6][7][8][9][10][11] commonly adopt a well-customized CNN structure to learn a complex non-linear mapping between many pairs of LR and HR images. The structure of the CNN tends to be deeper and wider, with increasing data to feed. The prior knowledge is recently used in the form of facial landmarks [1], parsing maps [12], or facial component heatmaps [13] for the special task of face SR. In this paper, we explore another form of information, i.e., the depth map of face, which can be incorporated to assist the face SR task.
The motivation originates from understanding the basic convolutional operation for the SR task. A deep CNN structure in fact learns to interpolate the missing pixels based on the local patterns of the input image. The convolution is a translation invariant operation that deals with the neighboring pixels in the 2D image coordinates. For example, the cascade of several convolution layers' receptive field for a given pixel p is shown in Fig. 1a. However, the intrinsic 3D proximity of the neighboring pixels within the receptive field is not equal to the 2D proximity of them, as shown in Fig. 1b. This is generally ignored in the literature and may lead to blurry effect on the edges of the faces. We argue that a network for face SR should also consider the 3D proximity of neighboring pixels, which can be inferred from a facial depth map (Fig. 1c).
In this paper, we propose an SR network architecture guided by the depth map to enhance the face SR performance. While it is not easy to learn the depth map for a general image, the face has limited geometric variations to infer a relatively accurate shape. The proposed network includes two branches: one for auxiliary depth map estimation and the other for the main SR task. Adaptive geometric features are further learned from the depth map and used to modulate the mid-level features of the SR branch. The whole network is end-to-end trainable, with a common SR reconstruction loss and an extra loss of depth supervision. During training, the supervision of paired depth map is involved. During testing, only raw LR input image is required. The experiments are carried out on two publicly available datasets FRGC v2.0 [14] and FFHQ [15] and demonstrate the effectiveness of the proposed method.  The main contributions of this paper are as follows: (1) we first propose to incorporate depth map for the face SR task. The depth map is either a matched one from RGB-D scans or a reconstructed one by a 3D prior model of faces. (2) We build a tailored network to estimate the depth map and to assist the main SR task. We use adaptive geometric features to modulate the mid-level features for face SR. The modulation operation is new and complementary to the general convolutional operations for SR tasks. (3) The proposed method leads to sharper edges for super-resolved images, which is a desirable goal for face SR.
This paper is organized as follows. Section 2 reviews the related works. Section 3 elaborates the proposed network architecture, the loss for training the network, and the way to prepare training data. Section 4 presents the experiments for the validation of the proposed method. Section 5 discusses limitations and future directions for this work. We conclude this paper in Section 6.

Related work
Since face SR belongs to the larger class of general image SR, this section reviews the related works in both image SR and face SR. We also focus on the state-of-the-art deep learning-based methods.

Image super-resolution
The advancement of CNNs has activated a lot of studies on image SR. Dong et al. [16] first introduce a three-layer CNN to learn a non-linear mapping between many pairs of LR and HR images. Then, Kim et al. [17] propose a deep CNN with 20 layers to learn the residuals between the paired LR and HR images. Early works focus on improving the performance of SR in terms of quantitative metrics such as the peak signal to noise ratio (PSNR) and structural similarity index measure (SSIM).
It is later found that minimizing the mean square error (MSE) prefers high PSNR/SSIM but lacks high-frequency details. Ledig et al. [6] introduce a perceptual loss which is a combination of an adversarial loss and a pixelwise MSE loss in a high-level feature layer of the VGGnet [18] pretrained by the imageNet classification task. Parts of their work, also known as SRResNet, use an advanced CNN structure [19] at that time to achieve better PSNR/SSIM. Since then, many works on image SR focus not only on PSNR/SSIM but also on better visualized results that contain high-frequency details. However, there is a trade-off between them since the loss function is usually a combination of several terms. While the earlier works evaluate the results by visualizing the super-resolved images, the current works propose quantitative metrics such as the perceptual index (PI) [20]. In this work, we focus on enhancing the performance of face SR in terms of PSNR and SSIM, since this will also set a higher baseline for results of good perceptual qualities.

Face super-resolution
Exploiting facial prior in face SR, such as the spatial locations of landmarks is the key difference from general image SR. There are many existing works dedicated to face SR using the prior knowledge of face. Before the coming of deep learning-based methods, early works [21,22] attempt to learn the super-resolved faces in some low-dimensional representations. This reduces the dimensions of the original ill-posed problem, thus leads to more realistic restored facial images. Although these methods only deal with the frontal pose, they provide clues to learn the face SR from the intrinsic facial structures. The recent CNN-based face SR methods [1,12,13,23,24] have progressed a lot, in terms of both quantitative metrics as PSNR/SSIM and perceptive visual qualities. The CNN-based methods do not replace the classical methods totally, but rather seem to combine and re-implement the old ideas, with the powerful new tool now. Meanwhile, the CNN-based face alignments have also progressed a lot, reducing the difficulties of face SR under pose variations. Song et al. [24] propose a facial structure generation network to restore a coarse HR face and use exemplar HR faces for detail enhancement. Zhu et al. [25] super-resolve unaligned faces with very low resolutions in a task-alternating cascade framework. Since more accurate face alignment promotes better SR results and vice versa, the task-alternating framework leads to improvements on both SR and face alignment. More recent works [1,12,13] implement some multitask networks and train in an end-to-end manner. They use either facial landmarks, parsing maps, or facial component heatmaps which are in fact different forms of face alignment. The aforementioned methods belong to single image-based SR. Contrary to that some other works [26,27] super-resolve a facial image with the help of a high-quality guided image. In this work, we propose to use the facial depth map, which is an intrinsic property of face to assist the face SR task.

Proposed method
In this section, we introduce the detailed network architecture, loss for training the network, and ways to prepare supervisory depth maps.

Network architecture
The architecture of the proposed network is illustrated in Fig. 2, and the detailed structures of the reused units are specified in Fig. 3 in different colors. The output channel size (n), the kernel size (k), and the stride (s) are indicated for each convolutional layer. We do not mark out the input channel sizes since they can be deduced by comparing the output channel sizes of two neighboring layers. The input and output layers should also be compatible with the input and output images. In summary, the network consists of three sub-blocks: the main SR block, the depth estimation block, and the modulation block, with the detailed architectures concluded in Table 1.
The main SR block is designed based on the basic network architecture of SRResNet [6]. We adopt the residual unit as the basic element of the network. The residual unit has two main advantages: (1) it makes the network deeper with minor computational cost, (2) the skip connection (residual) helps to solve problems of vanishing and exploding gradients and enables the network easier to be trainable. We also take the two improvements from [28] and [29]: removing the batch normalization layers and adding rescaling for the residual unit (see Fig. 3), which can enhance the performance of the network. In addition, the designed network learns the global residual in two stages rather than one as in [6]. The purpose is to maintain the full information for the mid-level features from the input image for modulation, while still keeping the local and global residual structures as in [6].
The depth estimation block adopts an hourglass architecture inherited from the wellknown Unet [30]. The feature concatenations in different feature expansions enable the network to learn the depth map in different level of features and in a more robust manner. The basic network structure with the depth block, the modulation block, and the SR block. The input and output channel sizes are commonly annotated in Fig. 3 for each convolutional layers specified in different colors. In the special cases for the input and output layers that take the 3-channel images as inputs or outputs, the input and output channel sizes should be modified to be compatible with the input and output images We design this architecture to estimate the depth map from an RGB image. This task is commonly difficult for general images in an uncontrolled environment. However, it is a popular issue worthy of study for indoor sceneries [31] because of the limited geometric variations. The face has also limited geometric structure which we believe that the depth can be learned from the RGB image.  The modulation block uses a cascade of several convolutional layers (followed by ReLU [32] activation) to learn adaptive geometric features from the depth map. The learned features are then fed into a mid-level layer of the main SR block. The mid-level feature layer is used for modulation because it has a medium size of receptive field to include the local image patterns. The modulation is implemented as element-wise products between the two feature maps. Since each individual convolutional feature can be seen as (nonlinear) combinations of neighboring pixels within the receptive field, the modulation in fact weights the individual convolutional features differently. The purpose is to adaptively adjust the convolutional features according to the geometric features learned from the depth map. This gives access to blend the useful 3D geometric information into the main SR task.

Loss for training the network
Recently many state-of-the-art works [1,6,28,33,34] on SR focus not only on reducing the mean square error (MSE) between the reconstructed and HR images, but also on better visualized results containing high-frequency details. As a result, they use different kinds of loss functions, e.g., the perceptual loss [6] and the adversarial loss [35]. However, there is a trade-off [20] between these pursued goals. In this work, we aim to enhance the performance of face SR in the sense of MSE, since this will also set a higher baseline for balancing results of good perceptual qualities.
Depth guided loss. The depth guided loss is used for the reconstruction of the guiding depth map. We employ the pixelwise MSE loss as: where I depth andÎ depth are the supervisory and estimated depth map, respectively, D(·) is the network of the depth block, and I LR is the input LR image. SR loss. We also adopt the common pixel-wise MSE loss as the SR loss: where I HR and I SR are the HR image and super-resolved image by the whole network, respectively. Total loss. We set the total loss function as the combination of the depth guided loss and the SR loss: where α is an adjustable parameter. In our experiments, the quantitative results on the validation set are almost stable for a wide range of α only if the two losses are numerically comparable, and we set α = 5. The whole network is implemented in an end-to-end trainable manner with the total loss in Eq. 3 (also refer to Fig. 2).

Preparing the depth maps
In this section, we provide two ways for the preparation of matched depth maps for training the face SR network.

Processing matched depth scans
One way is to directly get the depth data from raw RGB-D cameras. Since the raw depth scans may contain many outliers and holes, we use image dilation and erosion operations to remove the outliers and fill the holes. Image dilation and erosion are basic morphological operations used to handle raw depth scans in this study. Figure 4 shows an example. We first use erosion operations to remove the outliers for the valid region (mask) for the depth scans and then use dilation operations to fill the holes. The structuring array used in this work is as follows.
After the dilation and erosion are operated on the facial mask, we use bicubic interpolation to recover the missing pixels within the mask, as shown in Fig. 5. This generally provides high-quality depth maps but is restricted to matched RGB-D datasets.

3D face depth reconstruction
The other way is to use a prior 3D face model [36,37] to reconstruct the shape of a facial image. We assume that the 3D facial shape has a Gaussian distribution which can be expressed by a linear principal component analysis (PCA) model. Specifically, let the shape vector of a face be S = (x 1 , y 1 , z 1 , ..., x l , y l , z l ) T ∈ R 3l , where l is the number of 3D vertices. After Procrustes alignment, we conduct standard PCA analysis to the concatenated vectors of shapes for all exemplar 3D facial shapes. A new facial shape therefore can be represented as eigenvectors s i (in descending order according to their eigenvalues) of the covariance matrices, as whereS is the average shape, β i is the coefficient of each orthogonal base, and σ i is the variance of Gaussian distribution. This is actually a linear model for the representation of a facial shape. Then, the depth map can be calculated from the resulted pose and shape parameters. The reconstruction process is to minimize the following function: where x i is the value for each pixel and SOP(·) denotes scaled orthographic projection with respect to the rotation R, translation T, scaling factor s, and 3D shape S β as SOP(R, t, s, S β ) = s 1 0 0 0 1 0 RS β + T.
Finally, we convert the 3D shapes to depth maps with Z-buffer technology [38] and feed them into the training of the proposed network. Figure 6 shows an example of the reconstructed depth map. This way generates reasonable depth maps but lacks high-frequency details restricted by the specific 3D face prior models [36,[39][40][41][42][43].

Experiments
Datasets. We carry out our experiments on two publicly available datasets. (1) FRGCv2.0 [14] is one of the largest RGB-D datasets with matched depth maps. We take out some problematic faces from the 4007 samples of this dataset and retain the other 3683 faces of good quality. We select 3500 images as the training set, 100 images as the validation set, and the remaining 83 images as the test set. We then crop the images to obtain the facial regions according to the facial landmarks provided by [44]. The cropped images of faces are then resized to the resolution 256 × 256 with matched depth maps.
(2) FFHQ [15] is a recently released high-quality dataset. We select 5000 images as the training set, 100 images as the validation set, and another 100 images as the test set. We then resize the original images to the resolution 256 × 256. The matched depth maps for training are synthesized by a 3D prior face model as described in Section 3. For both datasets, the HR images are downsampled to obtain the LR images of resolution 64 × 64 with the bicubic kernel.
Training details. We implement the networks with the Pytorch platform. The proposed method is trained with the Adam optimizer [45] with an initial learning rate of 2 × 10 −4 . The mini-batch size is set to 32. Other parameters of the optimizer follow the default settings in Pytorch. The learning rate of the proposed network is decayed by a factor 10 every 100 epoches at a total number of 300 epoches. For the networks of other methods, we fine tune the learning rates to achieve the best performances. It takes ∼ 3 h on a GPU of GTX2080Ti to train the proposed model on the image resolution of 256×256.
Evaluation metrics. In the experiments, we adopt three quantitative metrics for the evaluation of the SR reconstructed results: (1) the peak signal to noise ratio (PSNR) measures the pixel-wise similarity between the reconstructed image and the HR reference image, (2) the structural similarity index measure (SSIM) [46] considers the local structural similarity between the two images, and (3) the perceptual index (PI) [20] prefers the high-frequency details as a non-reference metric that is different to the reference metrics such as PSNR and SSIM.

Learned depth maps
We expect that the depth block of the network can learn the depth map from the original LR image. The output of the trained depth block is extracted to validate it. Figure 7 shows some examples of the learned depth maps from the validation set, together with the ground-truth ones. We can see that the depth map is learned reasonably compared to the ground-truth. The learned depth maps show similar structures with the ground-truth ones in their geometric information. This demonstrates that the 3D shape of face can be well learned and further integrated into the main SR task.

Quantitative evaluations
We compare the proposed method with the state-of-the-art SR networks, in terms of PSNR, SSIM, and PI. We train the networks of VDSR [17], SRResNet [6], and RDN [47] with the same RGB images from the FRGCv2.0 and FFHQ dataset. We do not include some recently proposed networks for specific face SR [23,24] because they generally focus on visual qualities instead of quantitative metrics. Table 2 summarizes the quantitative results. We also include a trimmed version of the proposed network with only the main SR block as the baseline network for comparison. The results show that the modulation with adaptive geometric features leads to superior results over the baseline methods in quantitative metrics. All the evaluation metrics have achieved significant improvements by the proposed method for the FRGC v2.0 dataset with matched depth scans. For example, the PSNR shows 0.40 dB gains over the baseline network. Although the gain in PSNR for the FFHQ dataset with synthetic depth is only 0.12 dB, the resulted PI index achieves considerable improvement, which shows great advantages for sharper edges as in Fig. 8.

Qualitative evaluations
We show two examples from the two datasets in Fig. 8, marked with individual evaluation metrics. While the state-of-the-art methods such as SRResNet and RDN, and our trimmed SR network are competitive with each other for individual test samples, we find that the intervention with the depth map leads to almost uniformly superior results for the FRGC v2.0 dataset. The regions in the boundary of the face are remarkably sharper than  that obtained by the other methods for both of the two datasets. We owe these improvements to the distinction of 3D coordinates by virtue of the learned geometric features. This proves that the proposed method has positive effect on the face SR task, especially for sharper details of edges.

Ablation study
In this work, we propose the auxiliary depth and modulation blocks for the main SR task. We conduct two additional comparative experiments on the FRGC v2.0 validation set to demonstrate the effect of the auxiliary blocks. We also provide two ways for the preparation of matched depth maps, and an additional experiment with different supervisory depth maps is carried out for the ablation study. First, we exclude the auxiliary depth and modulation blocks to get a trimmed version of the proposed network. The purpose is to learn the effect of the added network structure. We train the trimmed network with the same settings and report the quantitative results on the validation set in Table 3. It shows that removing the depth and modulation blocks leads to 0.44 dB drop of PSNR, which demonstrates the added network structure has positive effect on the final results.
Then, we retain the whole network but exclude the guided depth loss for training. This is equal to the setting α = 0. This experiment aims to demonstrate the effectiveness of the guided geometric features without regard to the network architecture. The result in Table 3 shows 0.56 dB drop of PSNR without the guided loss even with the auxiliary blocks, which demonstrates that the supervisory depth actually contributes to the improvement of the face SR task. Finally, we conduct an additional experiment on the FRGC v2.0 dataset. The supervisory depth maps are the matched raw scans, the reconstructed ones, and the preprocessed ones from the raw scans, respectively. Table 4 shows the results in terms of PSNR using different depth maps. It shows that the raw depth map after preprocessing leads to superior results than both the raw one and the reconstructed one. This indicates that removing the outliers and holes for raw depth scans is beneficial for robust training of the network. Although the reconstructed depth maps show superior results over the baseline SR network, it is inferior to that by the matched depth scans. This may result from reconstruction errors to the ground-truth. Thus, the proposed method prefers raw depth maps from matched RGB-D scans.

Discussion
There are some remaining problems for the depth maps used to train the proposed network. First, the raw depth scans of the current hardware devices usually contain a lot of noise and errors. In this paper, we use morphological image operations to suppress the noise, and we also use bicubic interpolation to fill the "holes". Although these operations largely relieve the depth data from noise and errors, the depth scans are still far from the ground-truth ones. This should be a main factor hindering the performance for the proposed method. It is useful to collect a dataset with more advanced devices to train the proposed network for better performance. Secondly, the depth maps obtained by 3D face reconstruction methods are limited by the prior model, which lacks high-frequency details and novel structures that cannot be expressed by the model. Combining the prior model and raw depth scans is a possible way to construct closer depth map to the ground-truth, which promotes better solutions for the face SR problem guided by the depth maps. Finally, the improved performance with the depth map is at the cost of extra network architectures for the inference of depth map. Better network architectures can be explored to incorporate the depth information into the SR task.
It is worth mentioning that the success of the proposed method is not achievable without the use of large amount of available data, like most deep learning-based methods. The developments of effective data mining [48][49][50] and computing technologies [51][52][53][54] will push the face enhancement methods to real-life applications. Also, the prevalence of portable RGB-D scanning devices (e.g., iPhoneX) will provide more data and platforms for these methods. In addition, this model may incorporate with the state-of-the-art GAN model for a better visual performance based on the improved quantitative performance. In the future, we will develop more effective ways to incorporate depth information into the SR task, and specific data processing methods for various applications of face SR.

Conclusion
In this paper, we propose to use adaptive geometric features for the modulation of the face SR task. The face image is a special case of the general images that has limited geo- metric variations. We design a specific network structure to estimate the depth map from a facial image and then use it to produce adaptive geometric features to modulate the mid-level features for the main SR task. The supervisory depth map is either a matched one from RGB-D scans or a reconstructed one by a 3D prior model of faces. The experiments demonstrate that the acquired SR results are superior to the state-of-the-art works without depth guidance, especially with the help of real matched depth maps. We hope that the fast development and widely promotion of RGB-D cameras will lead to better solutions and applications for the face SR problem.