Computational Intelligence in Diagnosis and Prognosis of Gestational Diabetes using Deep Learning
Meenakshi K1, Maragatham G2
1SRM Institute of Science & Technology, Kancheepuram, Tamil Nadu, India, 603203
2SRM Institute of Science & Technology Kancheepuram, Tamil Nadu, India 603203
The medicinal and the computational field have an intrinsic connection, both the fields have been complementing to each other’s growth. Diabetes is a life-threatening disease and one such type of it is gestational diabetes which usually occurs in women during pregnancy due to low insulin levels but usually disappears after pregnancy. SKLearn is a powerful computational tool used for machine learning and to amplify the computational power and simplify the process we have used Keras interface. This model has 1000 neurons and predicts if the women will have diabetes post pregnancy.
Gestational Diabetes is a type of diabetes that transpires when a woman is carrying a child and is recurrent in the last months of pregnancy. It typically disappears once the child is born but many times a woman is likely to have diabetes even after delivery. This prevails if there is an antiquity of diabetes in the family if the mother is stout if the mother has delivered a baby whose weight was more than four kilograms during birth if the mother has PCOS and other such factors. The primary reason for its cause is low insulin levels in the body. This can be prevented by taking precautions by having a wholesome nutritious sustenance, exercising punctiliously and also observing the blood sugar levels on a regular basis.
Every day the field of medicine is discovering something new due to the computational advances; there has been ease of predicting sickness based on patient’s bygone wellness data. To take this a step further, we have used a Convolutional Neural Network to predict if a woman having Gestational Diabetes is likely to develop Type II diabetes later on in her life.
There are 9 variables used such as the number of times a woman has been pregnant, her age, her BMI, her skin thickness etc. For higher efficiency, we have used a combination of tools comprising of Keras, NumPy, Pandas, and SKLearn. All of these mentioned tools and APIs are compatible with Python hence making it a common platform for integration.
Pandas is a Python Library which is freely available for use and its foremost utility is to examine the data. It does data wrangling wherein it helps in polishing the raw data and converts it into a format which is more appropriate for analyzing. NumPy is a rudimentary collection for technical calculation. The elements in NumPy are stored in the form of a multi-dimensional array which can be modified using various functions. It can be reshaped, diminished, sorted, applied arithmetic operations upon etc. Sci-kit Learn tool helps in preprocessing of the data, it normalizes the data to be fed into the neural network. Keras is one the most effective and powerful Neural Network API which increases the efficiency, accuracy of a machine learning algorithm making it a high-level tool with minimal code. It is widely used in complex deep learning. We use Keras for creating a dense sequential model and fixing the number of Neurons in the Convolutional Neural Network.
1.1 Literature Survey:
Data mining techniques are used to analyze the precision of the data using the Tangara tool1 and have come to a conclusion that the Decision Tree algorithm using C4.5 in the Tangara tool is effective and produces an output which is nearly the same but the efficiency can be increased using a larger dataset. Based on a study, it is shown that soft computing tools play an important role in disease classification2, helping physicians as a tool for diagnosing diseases, such tools work better with fewer parameters. Diabetic Retinopathy is an illness which affects the eye due to high sugar levels or badly maintained sugar levels in the blood leading to blindness 3, automating the DR system using web-based platform with techniques to process the retinal images help in alleviating the work.
Using various algorithms such as - K-Mean Clustering, C4.5 Decision Tree, and other hybrid approaches were used for predicting the disease4. The clear result of this study says hybrid approaches outperform the rest in terms of its precision and the time taken to process the data. A model for monitoring patient’s health details to keep their diabetes under control5. It helps in keeping all the data secure and in one place. It is observed that it produces a pie chart and data list of the blood glucose levels of the patients, also keeping a monthly log of the glucose levels. This data can further be used for processing and predicting the future consequences of a patient and his likelihood of having diabetes in the future.
Various Machine Learning Algorithms are used to classify the health care data and the study explained the merits and demerits of machine learning approaches6. Artificial Neural Network is used to detect Diabetic Retinopathy, it produced better results in terms of sensitivity, specificity and high accuracy7. while the turnaround time was also less. Using the MNGHA dataset having 18 attributes, it has used simple data mining algorithms such as Random Forest, SOM, Quinlan algorithm8. This model suggested a diabetes prediction system to identify symptoms in early stage.
Diabetic Retinopathy using CNN has proven to be one of the best techniques9,10. But it performs better when their dataset is clean and the grading of images becomes easier. Comparison with traditional predicting models and hybrid models were discussed and it is concluded that hybrid models work better than the traditional models11. Internet of Things (IoT) has seen a rise in their applications. By using IoT, diabetes management is easy,12 it reduces the cost, induces higher security and is flexible to use and integrate with other platforms. Automated Diabetic Retinopathy Image Assessment Systems is a resource effective way of predicting retinic
Diabetes13, it is concluded that neural networks provide a more accurate result of prediction but the complexity involved in the ARIAS system is lesser. Experiments were conducted by applying the dataset of more than 700 records. The model handled nine attributes for detecting diabetes with various classification algorithms using WEKA tool 14 out of which the rep tree has more accuracy but consumes more processing time as compared to the decision stump having lesser accuracy but processing time is low. They have also further suggested combining algorithms with the fuzzy set for the prediction of unclear data.
2. RESEARCH APPROACH:
2.1 Data Collection:
The Dataset has readings of 768 American Indian women who chiefly live in Arizona and Northern Mexico and are aged between 21 - 50 years. The dataset totally has 9 variables out of which 8 variables are independent variables hence used for projection and the other one is a binary dependent variable.
These variables are as follows:
1. Number of times pregnant (preg):
It is the number of times the woman was expecting a child or was pregnant. The values for this variable are in integer format and the mean preg value is 3.85
2. Plasma Glucose Concentration (plas):
It measures the glucose levels in the blood based on the salivary glucose levels. The values for this variable are in integer format and the mean plas value is 121, the range of which is from 59 to 179
3. Diastolic Blood Pressure (pres):
It measures the pressure in the arteries when the cardiovascular system relaxes between pulsation. The values for this variable are in integer format and the mean plas value is 69.1. Diastolic blood pressure higher than 90 is risky to the person’s body.
4. Tricep Skinfold Thickness (skin):
It is a way to estimate the body fat, it is usually measured on our right arm’s tricep. The values for this variable are in integer format and the mean plas value is 20.5
5. Serum insulin (insu):
It helps is the most important factor to indicate the insulin resistance which in turn is a major aspect of the growth of PCOS. The values for this variable are in integer format and the mean plas value is 79.8
6. Body mass index (bmi):
It is calculated as follows: body weight/square of body height. The values for this variable are in integer format and the mean plas value is 32
7. Diabetes Pedigree Function (pedi)
It gives statistics of the women’s family background. The values for this variable are in integer format and the mean plas value is 0.47
The values for this variable are in integer format and the mean plas value is 33.2 with its values ranging from 21 to 50 years
2.2 Data Training:
The packages used are - NumPy, Pandas, Keras, SKLearn. These packages are installed in Anaconda using the anaconda prompt using the following code - conda install <library name> for instance - conda install keras. We can alternatively also use the pip command. It is the pipa recommended tool for installing python packages and it acts as a python package management software. Once the packages are installed, we import them using the import command. For instance - import keras.
We start by initializing a random number variable using the function numpy. random. seed(). This helps us in creating repeatable results. We load the data set using the function numpy. loadtxt() the parameters for which is the diabetes dataset and comma separated delimiter. This loaded data is transformed into a multi-dimensional array using the pandas function - pd. Data Frame (dataset). Once the data set is in the form of an array, we split it into two variables - X becomes a dependent variable with 8 different attributes and Y becomes the independent variable. The variable X is then standardized using the Standard Scaler() function. It calculates the middle value and then transforms the matrix for further calculations.
We then train the data sequentially with dense layers. The input has 8 parameters, the first and second hidden layers have 1000 and 500 neurons respectively with the Rectified Linear Unit ReLU activation function R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x, where x is the input neuron. This function can be used only with NN having hidden layers. The output has one layer with the S-shaped sigmoid activation function ranging from 0 to 1 - f(x) = 1 / 1 + exp(-x).
The model is then compiled using the model. Compile function. The loss parameter is a required parameter for compilation. The other required parameter is optimizer. We have used the Adam optimizer which is an expansion of the gradient descent algorithm. In the Adam optimizer the learning rate is maintained and also it separately updates as the machine learns. It is effective in producing fine results rapidly. The metrics is set to accuracy as we are looking for a highly accurate prediction.
2.3 Data Testing:
To ensure the model is represented in the best possible way, we fit the model using model. fit() function. The parameters for which are the input variable X, the output variable Y, the nb_epoch which tells the number of times the data is trained in the forward and backward pass - one epoch is one full training cycle, the batch size indicates the division/parts in which the data is trained - here we have set the batch size to 10 which means a set of 10 records are trained at a time and verbose gives information of what the machine is learning - it is a binary variable. The model is then evaluated and the evaluated values are printed.
The data set is normalized and is rounded. The details of the rounded dataset such as the type, shape, length are printed. The rounded dataset is then added as a new column stack to a new dataset. The new dataset is then modified as a multidimensional array using pandas library. The new dataset is which is tested is exported as an xlsx document. We then display a confusion matrix of actual and predicted values to see the level of accuracy in the testing and training phase. We noticed that the denser the neural network results more accurate result and shows the best prediction.
Fig. 1. Comparison of Accuracy based on various algorithms
Fig. 2. Comparison of Accuracy based on Number of Neurons in CNN
The CNN algorithm using Keras tool has proven to be most effective compared by logistic regression, support vector, Naive Bayes, K-Means and classification. Logistic regression revolves around the concept of having more than one independent variables. Support Vector uses a hyper plane concept for classification. Naive Bayes uses the following - Posterior Probability = (Likelihood*Class Prior Probability)/Predictor Prior Probability. Figure 1 shows the accuracy percentage of various algorithms.
Using the keras tool, we applied the Convolutional Neural Network algorithm for testing the dataset which predicts the post diabetes in more accurate manner. With 768 datasets, the more dense our network gets, the more accurate results. Figure 2 shows the comparison between the number of neurons and its accuracy. When the network has 10 neurons, it’s accuracy is 81.38%, when we increase it to 50 neurons, the accuracy increases to 85.2%, an accuracy of 86.99% is achieved with 100 neurons and an accuracy of 95.41% and 100% is achieved with 500 and 1000 neurons respectively.
The values after the data is normalized and trained. We created a rounded data set with the following attributes -
Rounded type: <class 'list'>
Shape of rounded: 768
Dataset type: <class 'numpy.ndarray'>
Shape of dataset: (768, 9)
Rounded type: <class 'numpy.ndarray'>
768/768 [=====================] - 0s 41us/step
The final accuracy is shown in the output as follows -
A confusion matrix is used for classification while we predict values. The matrix consists of four things
We then display a confusion matrix of the actual values and the predicted values in the following form.
Table 1. Confusion Matrix
Detection and prevention of gestational diabetes are important. Once it is detected, special care needs to be taken so as to prevent it from turning into Type - II Diabetes in the later stages of life. This CNN model presents an accurate measure of the data. The proposed work concluded that the denser the model, the accurate the result. We can further expand our research in various other medical domains for an early prophecy of diseases. It can include diverse population and the size of dataset can be increased. With the growth in technology, factors such as a urine test, hemoglobin test, menstrual cycle frequency can also be taken into consideration. To further analyze the data and being able to inspect more information, we can use unstructured data and integrate image processing systems.
5. AUTHOR CONTRIBUTIONS:
All authors (Meenakshi and Maragatham) contributed to the development and writing of the manuscript. All authors read and approved the final manuscript.
6. COMPLIANCE WITH ETHICAL STANDARDS:
Disclosure of potential conflicts of interest:
Ms. Meenakshi declares that she has no conflict of interest. Dr. Maragatham declares that she has no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
1. P. Hema, K. Palanivel, “A study on Prediction of Diabetic Disorder Using Classification Based Approaches”, International Journal of Computer Science and Mobile Computing, 2018; 7(1) : 53-60.
2. P. Suresh Kumar, V. Umatejaswi “Diagnosing Diabetes using Data Mining Techniques”. International Journal of Scientific and Research Publication, 2017; 7(6): 33-40.
3. Adnan Tufail, Caroline Rudisill, “Automated Diabetic Retinopathy Image Assessment Software”, American Academy of Ophthalmology, 2017; 124(3): 343-351.
4. Sankalp Deshkar, Thanseeh R.A. Varun G. Menon, “A review on IoT based m-Health Systems for Diabetes”, International Journal of Computer Science and Telecommunications, 2017; 8(1): 13-18.
5. N. Jayanthi, B. VijayaBabu, N. Sambasiva Rao, “Survey on Clinical Prediction Models for Diabetes Prediction”, Journal of Big Data, 2017; 4-26.
6. Meenakshi K, Safa M, Karthick T, Sivaranjani N, “ A novel Study of Machine Learning algorithms for classifying health care data”, 2017; 10(5): 1429-1432.
7. Harry Pratt, FransCoenen, Deborah M Broadbent, “Convolutional Neural Networks for Diabetic Retinopathy”, Elseiver – Procedia Computer Science, 2016; 90: 200-205
8. Tahani Daghistani, Riyad Alshammari, “Diagnosis of Diabetes by Applying Data Mining Classification Techniques”, (IJACSA) International Journal of Advanced Computer Science and Applications, 2016;7(7): 329-332.
9. WeeagulPratumgul, Worawat Sa-ngiamvibool, “The Prototype of Computer-Assisted for Screening and Identifying Severity of Diabetic Retinopathy Automatically from Color Fundus Images for mHealth System in Thailand”, Elseiver – Procedia Computer Science, 2016; 86: 457-460.
10. Dr.P.S. Jagadeesh Kumar, Ms. A.S. Chaithra “A survey on Cloud Computing based Health Care for Diabetes: Analysis and Diagnosis”, IOSR Journal of Computer Engineering (IOSR-JCE), 2015; 17(4): 109-117.
11. Dr. T. Karthikeyan, K. Vembandadsamy, “An Analytical Study on Early Diagnosis and Classification of Diabetes Mellitus”, International Journal of Computer Application, 2015; 5(5): 96-104.
12. Jose Tomas Arenas-Cavalli, Sebastian A Rios, Mariano Pola, Rodrigo Donoso, “A Web-Based Platform for Automated Diabetic Retinopathy Screening”, Elseiver – Procedia Computer Science, 2015; 60: 557- 563.
13. Saliman Rani, Dharminder Kumar, “A case study on Soft Computing Techniques Used for Diabetes Mellitus”, International Journal of Advanced Research in Computer Science and Software Engineering, 2014; 4(7): 1-5.
14. Divya Jain, Sumanlata Gautam “Predicting the Effect of Diabetes on Kidney using Classification in Tanagra”, International Journal of Computer Science and Mobile Computing, 2014; 3(4): 535-542.
Accepted on 17.05.2019 © RJPT All right reserved
Research J. Pharm. and Tech 2019; 12(8):3891-3895.