Diabetes prediction model based on an enhanced deep neural network

Today, diabetes is one of the most common, chronic, and, due to some complications, deadliest diseases in the world. The early detection of diabetes is very important for its timely treatment since it can stop the progression of the disease. The proposed method can help not only to predict the occurrence of diabetes in the future but also to determine the type of the disease that a person experiences. Considering that type 1 diabetes and type 2 diabetes have many differences in their treatment methods, this method will help to provide the right treatment for the patient. By transforming the task into a classification problem, our model is mainly built using the hidden layers of a deep neural network and uses dropout regularization to prevent overfitting. We tuned a number of parameters and used the binary cross-entropy loss function, which obtained a deep neural network prediction model with high accuracy. The experimental results show the effectiveness and adequacy of the proposed DLPD (Deep Learning for Predicting Diabetes) model. The best training accuracy of the diabetes type data set is 94.02174%, and the training accuracy of the Pima Indians diabetes data set is 99.4112%. Extensive experiments have been conducted on the Pima Indians diabetes and diabetic type datasets. The experimental results show the improvements of our proposed model over the state-of-the-art methods.


Introduction
According to a report by the International Diabetes Federation in 2017 [1], there were 425 million diabetics in the world at the time, and it was also concluded that the number will increase to 625 million by 2045 [2]. Diabetes mellitus is a group of endocrine diseases associated with impaired glucose uptake that develops as a result of the absolute or relative insufficiency of the hormone "Insulin." The disease is characterized by a chronic course, as well as a violation of all types of metabolism. Generally, diabetes is classified into four categories [3]: type 1 diabetes, type 2 diabetes, gestational diabetes mellitus, and specific types of diabetes due to other causes. The most common types of the disease are the following two: type 1 diabetes (T1D) and type 2 diabetes (T2D). The former is caused by the destruction of the pancreatic beta cells, resulting in insulin deficiency, while the latter is due to the ineffective transportation of insulin into cells.
Both types of the disease can lead to life-threatening complications, such as strokes, heart attacks, chronic renal failure, diabetic foot syndrome, antipathy, neuropathy, encephalopathy, hyperthyroidism, adrenal gland tumors, cirrhosis of the liver, glucagonoma, transient hyperglycemia, and many other complications. Hence, the prediction [4] and early detection [5] of diabetes is essential for all people who are predisposed to diabetes. Currently, several diseases can be diagnosed using artificial intelligence (AI) techniques, and deep neural networks [6] have achieved the best performance in classification problems. In recent years, DNNs have been used for diagnosing various diseases. In the absence of diabetes, the pancreas works fine and produces enough insulin. As soon as insulin binds to receptors on the surface of the cell, the entry for the glucose molecule into the cell will open as well. With T1D, the pancreas gradually stops producing insulin, which accordingly disrupts the process of glucose delivery to cells. T2D is not caused by the pancreas not being able to produce insulin. There is enough insulin and glucose entering cells, but the insulin receptors that allow insulin to enter into cells have lost their ability to respond to insulin. The processes through which cells absorb glucose for people who are normal, have type 1 diabetes, or have type 2 diabetes are shown in Fig. 1.
Deep learning is a new research direction in the field of machine learning [7,8], and in recent years, it has achieved breakthrough progress in speech recognition and computer vision applications The neural network was originally developed using machine perception. The difference between it and machine perception is that it joined the multiple hidden layers [9]. As the depth of the network deepens, the feature level increases, and this enhances the expression ability of the model. The output layer can have multiple outputs, and the model can be flexibly applied to classification, regression, downscaling, and clustering. The activation function of the perceptron is sign (z), which is simple, but sign (z) has a limited processing capacity. Deep neural networks generally use the Sigmoid, Softmax, tanx, ReLU, softplus, and other activation functions and add nonlinear factors to improve the expression ability of the model.
The deep neural network [10,11] is an extension of machine perception, and sometimes it is called the multilayer perceptron (MLP). According to the location of the different layers, the layers of the DNN can be divided into three categories: the input Fig. 1 Processes of glucose entering cells. This figure describes three different processes of glucose entering cells. The first process is a normal process, when a person was not diagnosed with diabetes and his pancreas produces enough insulin. The second process is when a patient has type 1 diabetes. With type 1 diabetes, the pancreas gradually stops producing insulin, which accordingly disrupts the process of glucose delivery to cells. The third process is showing the consequences of type 2 diabetes, which is not caused by the pancreas not being able to produce insulin. There is enough insulin and glucose entering the organism, but the insulin receptors that allow insulin to enter cells have lost the ability to respond to insulin. Both types are very dangerous and cause a long list of complications and illness layer, hidden layers, and the output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the middle layers are the hidden layers. According to the problems in different fields, people have developed different deep neural networks, such as the convolutional neural network (CNN) and the recurrent neural network (RNN). Due to the limitation of the parameters and the mining of the local structure, the CNN model is suitable for image recognition and speech recognition. The RNN can be viewed as a neural network that transmits over time, and its depth is the length of time. The RNN is applicable to natural language processing and handwriting recognition because the chronological order of samples' appearance is important for these areas.
The connection mode between layers is fully connected, that is to say, one neuron on layer i is connected to all neurons on layer i+1. The forward propagation algorithm in the DNN starts from the input layer. It uses the input vector x, a number of weight coefficient matrixes W and an offset vector b to make a series of linear operations and activating operations. An output layer is used to calculate the next output layer, and this proceeds all the way to the computing results for the output to the output layer. In the forward propagation process, the input information is processed by the hidden layers and then transmitted to the output layer through the hidden layers.
Without the activation function, each of the outputs is a linear function of the output from the upper layer. No matter how many layers are in the neural network, the output is only a linear combination of the input, and the effect is the same as without the hidden layers. The DNN introduces the activation function, which can effectively avoid the same effect as the single-layer linear function, improve the expression ability of the model, and make the model more differentiated. Generally, the activation function usually has the following properties: nonlinear, differentiability, and monotonicity.
A vanishing gradient occurs when the gradient of the back layer stacks to the front layer in a continuous way in the back propagation algorithm due to the chain rule. When a neural network uses the model S activation function, due to its saturation characteristic, when its input reaches a certain value, the output will not change significantly, and the derivative gradually tends to 0. When the neural network uses a gradient to update the parameters, if the continuously produced number in each layer is less than 1, the gradient will be increasingly smaller, and the error gradient for the top layer will decrease to almost zero; thus, it is unable to effectively update the parameters in the former layer.
The general rule in choosing an activation function and loss function in a DNN is that if the output of the neuron is linear, the square loss function is a right choice, and if the output neuron uses the model S activation function, the cross-entropy loss function is the right choice. The combination of the softmax activation function and the logarithmic likelihood loss is similar to the combination of the sigmoid function and the cross entropy; therefore, generally, the sigmoid activation function and cross entropy are used for the binary classification output and the softmax activation function and the logarithmic likelihood loss are used for the multiclassification output.
The DNN usually uses dropout regularization. Dropout (random inactivation) is as follows: In the training process, the training data can be divided into several groups. Then, when the DNN iterates the gradient descent using data, at a certain point, it randomly discards a part of the neuron node temporarily, and then it uses this to get rid of the hidden layer neuron network to fit a set amount of training data, and update all the weights and biases (W, b). Before the iteration of the next amount of data, the DNN model will be restored to the original fully connected model, then the neurons in the hidden layer will be randomly removed, and the weight and bias will be iteratively updated.
This work intends to present a novel deep neural network-based model (diabetes type prediction model) for diabetes prediction and the determination of the possible types of the disease in the future.
The main contributions of this paper are as follows: (i) We propose a diabetes risk prediction model, DLPD, which can not only predict whether someone will have this disease in the future but also determine the type of disease that a person may have in the future: T1D or T2D. (ii) We add normalization layers to the model that memorize the results from the training data and address data that it has not seen. During the training time, at each update, dropout randomly sets a fraction 'p' of the input units to 0. The main purpose of dropout regularization is to prevent overfitting. (iii)We choose the binary cross-entropy function as the loss function in this model.
The hyper parameters are adjusted in the model. The window for accumulating past gradients is limited to a fixed size ω. The sum of the gradients is recursively defined as the attenuation average of all past square gradients. The running average at time step t then depends (as a fraction γ, similarly to the momentum term) only on the previous average and the current gradient. (iv) The experimental results show the effectiveness and adequacy of the proposed DLPD model, the training accuracy of the diabetes type data set and the training accuracy of the Pima Indians diabetes data set are ideal.

Related work
Health care is one of the most important areas that societies should develop through science [12,13] and technology [14,15]. Deep learning methods [16][17][18] are powerful tools that complement traditional machine learning and allow computers to learn from data [19] so that they can come up with ways to create smarter applications [20], to process electronic health records, and to use computational vision for clinical imaging and genomics. A novel deep learning approach for the detection of type 2 diabetes was proposed by Mohebbi et al. in [21], where the authors proved that it was possible to use CGM signals to detect T2D patients. To address the challenges of using DL techniques in healthcare today, the authors focused their discussions on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods in [22]. The opportunities and obstacles for deep learning-based methods in biology and medicine have also been discussed. It can match or surpass the previous state-of-the-art methods on a diverse array of tasks in patient and disease categorization, fundamental biological studies, genomics, and treatment development [23]. Li et al. [24] demonstrated ModelHub for deep learning lifecycle management, which includes the following components: a novel model versioning system (dlv), a domain-specific language for searching the model space (DQL), and a hosted service (ModelHub) to store learned models, explore existing models, and share models. Kim et al. [25] proposed a deep network structure using SVMs with CPONs to provide adequate structural depth and robust classification accuracy.  [27] presented deep learning as an effective approach for predicting hospital readmissions among diabetic patients in their research. Pham et al. [28] presented another deep learning-based approach to predict healthcare trajectories using the medical records of patients. Bae et al. [29] studied predicting the risks of type 2 diabetes using common and rare variants. Kannadasan et al. [30] proposed a new model for type 2 diabetes data classification. A deep neural network (DNN) was built using stacked autoencoders cascaded with a softmax classifier and it achieved a classification accuracy of 86.26%. Zhu et al. [31] proposed a convolutional neural network (CNN) model to forecast the future glucose levels of patients with type 1 diabetes. The model was a modified version of Wave-Net, which was very useful for acoustic signal processing. Kowsher et al. [32] proposed a deep neural network and machine learning classifier using performance measures such as accuracy and precision to determine the best deep neural network algorithm. Soniya et al. [33] proposed joining a hybrid evolutionary approach with a convolutional neural network (CNN) and determined the number of layers and filters based on the application and user needs. Ramazi et al. [34] developed a wide and deep neural network and used the data from demographic information, lab tests, and wearable sensors to create the model. Alharbi et al. [35] proposed a hybrid algorithm, the GA-ELM algorithm, which optimally diagnosed type 2-diabetes patients, and classified the data set with an accuracy of 97.5% using six effective features out of the original eight features given in the dataset. The use of machine learning and deep learning algorithms for diabetes prediction, as well as comparisons of the algorithms and established models for diabetes prediction, have been accomplished by some of these related works. However, the prediction algorithms could predict only the presence or absence of the probability of having the disease in the future. In this research, we propose a novel DNN-based model for diabetes risk prediction and the determination of a specific type of the disease, T1D or T2D, which can occur in the future.

Methods
The inspiration for DL models is rooted in the functioning of biological nervous systems. These models are not new because their roots trace back to the introduction of the McCulloch-Pitts (MCP) model, which is considered the ancestor of the artificial neural model that has now gone mainstream because of its many practical applications and the availability of consumable technology and affordable hardware.  Table 1 shows the data needed to predict diabetes, as well as the descriptions and values of these attributes. The proposed model is tested on two datasets: the Pima Indians diabetes data from the UCI repository [36] and the diabetes type dataset based on blood sugar, plasma glucose, and HbA1c (glycated hemoglobin) from the Data World repository [37] ( Table 2). The DLPD model consists of the following parts: (i) data pre-processing, (ii) building and training the DLPD model, (iii) adding dropout regularization to address overfitting, and (iv) hyper parameter tuning. They are described as follows (Tables 3, 4, and 5):

Data preprocessing
Data preprocessing is necessary to prepare the diabetes type data and Pima Indians data in a manner that a deep learning model can accept. Separating the training and testing datasets ensures that the model learns only from the training data and tests its performance with the testing data. The dataset was divided into training and test data. The training data contain 70% of the total dataset, and the test and validation data contain 15% each. At first, all of the data were shuffled.

Building and training the DLPD model
The construction of a deep learning model includes three types of layers.
The input layer is the layer to which the features of datasets will be passed. There is no computation that occurs in this layer. It serves to pass features to the hidden layers. The hidden layers are the layers between the input layer and the output layer. There can be various numbers of hidden layers and not only one. These layers perform the computations and pass the information to the output layer at the end. The output layer represents the layer of our neural network. It will give the results after training a new created model. It is responsible for producing the output variables.
The input layers represent the particular input ports in networks. The regular densely connected layers with their output dimensions use softmax activation and linear  activation with the weight initialization function. The activation layers apply an activation function to an output. Batch normalization layers are used to normalize the activations of the previous layer at each batch. Applied transformation maintained the mean activation close to 0 and the activation standard deviation close to 1.

Adding dropout regularization to fight overfitting
Predictive models can often face a problem known as overfitting. Overfitting occurs when the difference in accuracies is very high. The model memorizes the results from the training data and cannot be applied to data that it has not seen. To help fight overfitting in our model, we added normalization layers to our model. During the training, at each update, dropout randomly sets a fraction "p" of the input units to 0. The main purpose of dropout regularization is to prevent overfitting.

Hyper parameter tuning
Unlike machine learning models, deep learning models are literally full of hyper parameters, which determine the network structure (number of hidden units) and the variables that determine how the network is trained (learning rate). We set the number of epochs as 10 (the number of times the whole training data is shown to the network while training). The batch size is the number of sub samples given to the network after the parameter update happens, and we set the batch size to 32. The binary cross entropy was selected as the loss function. In multilabel problems, when one example can belong to multiple classes at the same time, the model tries to determine whether the example belongs to that class or not for each class. Binary cross entropy measures how far away from the true value the prediction is for each of the classes and then averages these classwise errors to obtain the final loss. AdaDelta is an extension of AdaGrad. Instead of accumulating all past squared gradients, AdaDelta restricts the window of accumulated past gradients to some fixed size ω. Instead of inefficiently storing ω previous squared gradients, the sum of the gradients is recursively defined as a decaying average of all past squared gradients. The running average E[g 2 ] t at time step t then depends (as a fraction γ similar to the Momentum term) only on the previous average and the current gradient: Then, γ is set to a similar value as the momentum term, which is approximately 0.9. For clarity, we now rewrite our vanilla SGD update in terms of the parameter update vector Δθt: The parameter update vector of AdaDelta that was derived previously thus takes the following form: Now, we simply replace the diagonal matrix Gt with the decaying average over past squared gradients E[g 2 ] t : Since the denominator is just the root mean squared (RMS) error criterion of the gradient, it is replaced with the short-hand criterion:

Visualizing loss and accuracy
Visualization of the performance of any deep learning model is an easy way to make sense of the data being output by the model and make an informed decision about the changes that need to be made on the parameters or hyper parameters that affect the deep learning model. The plots provided in Fig. 3 show the training loss/accuracy and validation loss/accuracy. From the accuracy plot, we can see that the model achieved high training accuracy on both datasets. We can also see that the model has not yet overlearned the training dataset, showing comparable results for both datasets.  The experimental results of this work have reached a very good ratio between the accuracy and loss. The experiments proved that our proposed model can perform well on different types of data. The proposed model not only can predict if a person will be diabetic in the future but also can determine and predict the specific type of the disease, type 1 or type 2. Table 6 shows the prediction results of the proposed model on the DT dataset. From Table 7, we can conclude the final performance of the proposed model on the Pima Indians dataset is close to 100%. Table 8 represents the different performance values of proposed DNN-based model on the basis of the classified instances. The main factor is that related works presented various methods and models based on DL and ML for predicting diabetes only, but our proposed model predicts diabetes and determines the possible type of the disease that can occur in the future. Therefore, it is not entirely fair to compare the accuracy measures of the methods. A deep cognition platform can provide a WebApp/API from a trained DNN model. To organize the inference of the proposed model, a user should only enter data such as the following: age, plasma R, Fig. 3 Visualization of the performance of the trained model. This figure describes the results of the finished experimental work. Four plots are drawn for the following: the training accuracy, the training loss, the validation accuracy, and the validation loss. All of them describe the achieved accuracies during experiments. The main goals were to achieve high accuracy and the lowest loss of data. From these figures, we can see that experimental results of this work have reached a very good ratio between the accuracy and loss. The experiments proved that our proposed model can perform well on a different type of data. The proposed model can not only predict if a person will be diabetic in the future but also determine and predict the specific type of the disease, type 1 or type 2 plasma F, BSpp, HbA1c, and BS Fast. Submitted data will be analyzed and will give a personal prediction and probability of it as a percentage ratio.

Conclusion
The main key in finding the right treatment for diabetes is to detect the disease in an early stage. In the present study, a new DNN model (DTP model) was proposed for diabetes type prediction. The deep neural network was pretrained with 2 datasets with each dataset containing more than a thousand records. The number of epochs for the training phase was low, which ensures that the method can work rapidly, even on any mobile platform. The experimental results show the effectiveness and adequacy of the proposed DTP model. The best result for the diabetes type dataset was 94.02174% and  that for the Pima Indians diabetes dataset was 99.4112%. A future study will likewise focus on improving the model for determining all possible complications, including an orderly sequence in terms of the percentage of possible complications that can occur. The work can be extended and improved for automated diabetes analysis by including some other deep learning algorithms and techniques. The amount of data that the model can handle is high. In the hyper parameter tuning method, because training too many parameters can easily result in overfitting, the algorithm can also simply modify the last output layer. If the data are too different from the original dataset, the model can tune half of the layer after fine-tuning the output of the top layer.