Diabetes Prediction Using Machine Learning

Aman Preet Gulati 17 Mar, 2024 • 10 min read

This article was published as a part of the Data Science Blogathon.

Overview

In this article, we’ll delve into predicting whether a patient might have diabetes or not. We’ll achieve this by feeding certain features into our machine learning model. For this task, we’ll tap into the renowned Pima Indians Diabetes Database, which holds valuable information aiding in diabetes prediction using machine learning techniques.

Diabetes Prediction
Image Source: Plastics Today
  1. Data analysis: Here one will get to know about how the data analysis part is done in a data science life cycle.
  2. Exploratory data analysis: EDA is one of the most important steps in the data science project life cycle and here one will need to know that how to make inferences from the visualizations and data analysis
  3. Model building: Here we will be using 4 ML models and then we will choose the best performing model.
  4. Saving model: Saving the best model using pickle to make the prediction from real data.

What is Diabetes prediction using machine learning

Diabetes prediction using machine learning means using computer programs to guess if someone might get diabetes. These programs look at different things like your health history and lifestyle to make their guess. They learn from lots of examples of people with and without diabetes to make better guesses. For example, they might look at how much sugar someone eats or if they exercise regularly. By doing this, they can give early warnings to people who might be at risk of getting diabetes so they can take better care of themselves.

Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

from mlxtend.plotting import plot_decision_regions
import missingno as msno
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Here we will be reading the dataset which is in the CSV format

diabetes_df = pd.read_csv('diabetes.csv') diabetes_df.head()

Output:

Dataset in CSV Format

Exploratory Data Analysis (EDA)

Now let’ see that what are columns available in our dataset.diabetes_df.columns

Output:Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’], dtype=’object’)

Information about the datasetdiabetes_df.info()

Output:RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB

To know more about the datasetdiabetes_df.describe()

Output:

Exploratory Data Analysis (EDA)

To know more about the dataset with transpose – here T is for the transposediabetes_df.describe().T

Output:

Output

Now let’s check that if our dataset have null values or notdiabetes_df.isnull().head(10)

Output:

Output 2 | Diabetes Prediction

Now let’s check the number of null values our dataset has.diabetes_df.isnull().sum()

Output:Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64

Here from the above code we first checked that is there any null values from the IsNull() function then we are going to take the sum of all those missing values from the sum() function and the inference we now get is that there are no missing values but that is actually not a true story as in this particular dataset all the missing values were given the 0 as a value which is not good for the authenticity of the dataset. Hence we will first replace the 0 value with the NAN value then start the imputation process.diabetes_df_copy = diabetes_df.copy(deep = True) diabetes_df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]] = diabetes_df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]].replace(0,np.NaN) # Showing the Count of NANs print(diabetes_df_copy.isnull().sum())

Output:Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64

As mentioned above that now we will be replacing the zeros with the NAN values so that we can impute it later to maintain the authenticity of the dataset as well as trying to have a better Imputation approach i.e to apply mean values of each column to the null values of the respective columns.

Data Visualization

Plotting the data distribution plots before removing null valuesp = diabetes_df.hist(figsize = (20,20))

Output:

Data Visualization | Diabetes Prediction using machine learning

Inference: So here we have seen the distribution of each features whether it is dependent data or independent data and one thing which could always strike that why do we need to see the distribution of data? So the answer is simple it is the best way to start the analysis of the dataset as it shows the occurrence of every kind of value in the graphical structure which in turn lets us know the range of the data.

Now we will be imputing the mean value of the column to each missing value of that particular column.diabetes_df_copy[‘Glucose’].fillna(diabetes_df_copy[‘Glucose’].mean(), inplace = True) diabetes_df_copy[‘BloodPressure’].fillna(diabetes_df_copy[‘BloodPressure’].mean(), inplace = True) diabetes_df_copy[‘SkinThickness’].fillna(diabetes_df_copy[‘SkinThickness’].median(), inplace = True) diabetes_df_copy[‘Insulin’].fillna(diabetes_df_copy[‘Insulin’].median(), inplace = True) diabetes_df_copy[‘BMI’].fillna(diabetes_df_copy[‘BMI’].median(), inplace = True)

Plotting the distributions after removing the NAN values.p = diabetes_df_copy.hist(figsize = (20,20))

Output:

Output 3| Graphs | Diabetes Prediction

Inference: Here we are again using the hist plot to see the distribution of the dataset but this time we are using this visualization to see the changes that we can see after those null values are removed from the dataset and we can clearly see the difference for example – In age column after removal of the null values, we can see that there is a spike at the range of 50 to 100 which is quite logical as well.

Plotting Null Count Analysis Plotp = msno.bar(diabetes_df)

Output:

Graph description on Diabetes Prediction

Inference: Now in the above graph also we can clearly see that there are no null values in the dataset.

Now, let’s check that how well our outcome column is balancedcolor_wheel = {1: “#0392cf”, 2: “#7bc043”} colors = diabetes_df[“Outcome”].map(lambda x: color_wheel.get(x + 1)) print(diabetes_df.Outcome.value_counts()) p=diabetes_df.Outcome.value_counts().plot(kind=”bar”)

Output:0 500 1 268 Name: Outcome, dtype: int64

Outcome | Diabetes Prediction

Inference: Here from the above visualization it is clearly visible that our dataset is completely imbalanced in fact the number of patients who are diabetic is half of the patients who are non-diabetic.plt.subplot(121), sns.distplot(diabetes_df[‘Insulin’]) plt.subplot(122), diabetes_df[‘Insulin’].plot.box(figsize=(16,5)) plt.show()

Output:

Output 4 | Diabetes Prediction

Inference: That’s how Distplot can be helpful where one will able to see the distribution of the data as well as with the help of boxplot one can see the outliers in that column and other information too which can be derived by the box and whiskers plot.

Correlation between all the features

Correlation between all the features before cleaningplt.figure(figsize=(12,10)) # seaborn has an easy method to showcase heatmap p = sns.heatmap(diabetes_df.corr(), annot=True,cmap =’RdYlGn’)

Output:

Correlations between all the features

Scaling the Data

Before scaling down the data let’s have a look into itdiabetes_df_copy.head()

Output:

Scaling the data

After Standard scaling
sc_X = StandardScaler() X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop([“Outcome”],axis = 1),), columns=[‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’]) X.head()

Output:

Standard Scaling

That’s how our dataset will be looking like when it is scaled down or we can see every value now is on the same scale which will help our ML model to give a better result.

Let’s explore our target columny = diabetes_df_copy.Outcome y

Output:0 1 1 0 2 1 3 0 4 1 .. 763 0 764 0 765 0 766 1 767 0 Name: Outcome, Length: 768, dtype: int64

Model Building

Splitting the datasetX = diabetes_df.drop(‘Outcome’, axis=1) y = diabetes_df[‘Outcome’]

Now we will split the data into training and testing data using the train_test_split functionfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=7)

Random Forest

Building the model using RandomForestfrom sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=200) rfc.fit(X_train, y_train)

Now after building the model let’s check the accuracy of the model on the training dataset.rfc_train = rfc.predict(X_train) from sklearn import metrics print(“Accuracy_Score =”, format(metrics.accuracy_score(y_train, rfc_train)))

Output: Accuracy = 1.0

So here we can see that on the training dataset our model is overfitted.

Getting the accuracy score for Random Forestfrom sklearn import metrics predictions = rfc.predict(X_test) print(“Accuracy_Score =”, format(metrics.accuracy_score(y_test, predictions)))

Output:Accuracy_Score = 0.7677165354330708

Classification report and confusion matrix of random forest modelfrom sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions))

Output:

Random Forest Output

Decision Tree

Building the model using DecisionTreefrom sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(X_train, y_train)

Now we will be making the predictions on the testing data directly as it is of more importance.

Getting the accuracy score for Decision Treefrom sklearn import metrics predictions = dtree.predict(X_test) print(“Accuracy Score =”, format(metrics.accuracy_score(y_test,predictions)))

Output:Accuracy Score = 0.7322834645669292

Classification report and confusion matrix of the decision tree model
from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions))

Output:

Decision Tree

 XgBoost classifier

Building model using XGBoostfrom xgboost import XGBClassifier xgb_model = XGBClassifier(gamma=0) xgb_model.fit(X_train, y_train)

Output:

XgBoost Classifier

Now we will be making the predictions on the testing data directly as it is of more importance.

Getting the accuracy score for the XgBoost classifierfrom sklearn import metrics xgb_pred = xgb_model.predict(X_test) print(“Accuracy Score =”, format(metrics.accuracy_score(y_test, xgb_pred)))

Output:Accuracy Score = 0.7401574803149606

Classification report and confusion matrix of the XgBoost classifier
from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, xgb_pred)) print(classification_report(y_test,xgb_pred))

Output:

Output

 Support Vector Machine (SVM)

Building the model using Support Vector Machine (SVM)from sklearn.svm import SVC svc_model = SVC() svc_model.fit(X_train, y_train)

Prediction from support vector machine model on the testing datasvc_pred = svc_model.predict(X_test)

Accuracy score for SVMfrom sklearn import metrics print(“Accuracy Score =”, format(metrics.accuracy_score(y_test, svc_pred)))

Output:Accuracy Score = 0.7401574803149606

Classification report and confusion matrix of the SVM classifierfrom sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, svc_pred)) print(classification_report(y_test,svc_pred))

Output:

Support Vector Machine

The Conclusion from Model Building

Therefore Random forest is the best model for this prediction since it has an accuracy_score of 0.76.

Feature Importance

Knowing about the feature importance is quite necessary as it shows that how much weightage each feature provides in the model building phase.

Getting feature importancesrfc.feature_importances_

Output:array([0.07684946, 0.25643635, 0.08952599, 0.08437176, 0.08552636, 0.14911634, 0.11751284, 0.1406609 ])

From the above output, it is not much clear that which feature is important for that reason we will now make a visualization of the same.

Plotting feature importances(pd.Series(rfc.feature_importances_, index=X.columns).plot(kind=’barh’))

Output:

Output 5 | Diabetes Prediction

Here from the above graph, it is clearly visible that Glucose as a feature is the most important in this dataset.

Saving Model – Random Forest

import pickle # Firstly we will be using the dump() function to save the model using pickle saved_model = pickle.dumps(rfc) # Then we will be loading that saved model rfc_from_pickle = pickle.loads(saved_model) # lastly, after loading that model we will use this to make predictions rfc_from_pickle.predict(X_test)

Output:

Saving Model - Random Forest

Now for the last time, I’ll be looking at the head and tail of the dataset so that we can take any random set of features from both the head and tail of the data to test that if our model is good enough to give the right prediction.diabetes_df.head()

Output:

Output

diabetes_df.tail()

Output:

Output Diabetes Prediction

Putting data points in the model will either return 0 or 1 i.e. person suffering from diabetes or not.rfc.predict([[0,137,40,35,168,43.1,2.228,33]]) #4th patient

Output:array([1], dtype=int64)

Another onerfc.predict([[10,101,76,48,180,32.9,0.171,63]]) # 763 th patient

Output:array([0], dtype=int64)

How Can Machine learning Predict Diabetes?

Data gathering: Collect a thorough dataset with details on individuals’ health records, daily routines, and physical measurements related to predicting diabetes through machine learning.

Data preprocessing involves eliminating inconsistencies and errors to clean the data. This measure guarantees the dataset’s appropriateness for training machine learning algorithms in detecting diabetes through machine learning techniques.

Feature Selection: Recognize and choose important attributes like blood sugar levels, BMI, family history, and age. These characteristics are essential for accurately predicting diabetes.

Train models with machine learning algorithms: Such as random forest or neural networks using the prepared dataset for model training. While being trained, the models are taught to identify patterns that suggest the presence of diabetes by studying examples.

Evaluate the trained models : Performance by using metrics such as accuracy, precision, recall, and F1-score. This measure guarantees the accuracy of diabetes prediction by the models.

Prediction of diabetes: Utilize the trained machine learning models to forecast the probability of individuals experiencing diabetes according to their provided data.

Continuous Monitoring: Set up a mechanism to continuously monitor and revise the models with the arrival of new data. This guarantees that the models stay precise and applicable for forecasting diabetes with machine learning in actual situations.

Conclusion

After analyzing all these patient records, we’ve developed a machine learning model (specifically, random forest, which performed the best) that can effectively predict whether individuals in the dataset have diabetes. Alongside this, we’ve gained valuable insights from the data through analysis and visualization, aiding in the prediction of diabetes using machine learning techniques.

Here’s the repo link to this article.

Frequently Asked Questions

Q1. What algorithms are used for diabetes prediction?

Machine learning techniques such as decision trees, logistic regression, neural networks, and random forests are commonly used for diabetes prediction. These algorithms examine data about blood sugar levels and lifestyle choices to predict the probability of developing diabetes. This method is referred to as machine learning for predicting diabetes.

Q2.Why do we use SVM in diabetes prediction?

Support Vector Machines (SVM) were selected for predicting diabetes because of their capability to deal with intricate datasets with high dimensionality. SVM efficiently categorizes individuals into diabetic and non-diabetic groups using different input factors, aiding in the creation of precise diabetes prediction models.

Q3.Can AI detect diabetes?

Truly, artificial intelligence, particularly through the use of machine learning, can effectively detect diabetes by analyzing the medical records and physical symptoms of patients. This method known as machine learning can predict diabetes, enabling early identification of at-risk individuals and facilitating the implementation of preventive measures and personalized healthcare plans.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

Aman Preet Gulati 17 Mar 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Ramu
Ramu 06 Jan, 2022

I gone through this Diabetes Prediction Using Machine Learning and noticed below observation. Not sure, I observed correctly but, I noticed something in the below code. Before Model Building, by using the StandardScaler, you scaled the data for input features 'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age' and assigned to X variable.And, from diabetes_df_copy dataframe you took the Outcome feature into y. But, when you started Model Building to split the data using Train test split, you utilized the diabetes_df dataframe (which is not imputed version of dataframe) instead of using already scaled data in X. And also, before you impute the data, you copied the data from diabetes_df into diabetes_df_copy dataframe. I think you supposed to use scaled data in X for all ML Model builds. Can you please correct me if I observed incorrectly.

Robert E Hoyt
Robert E Hoyt 06 Jan, 2022

Good demonstration of a real-world machine learning process. As a clinician I have concerns about using this dataset without some medical expertise. For example, you can't have triceps thickness or insulin levels of zero. This means the test was not done. Is imputation legitimate in this situation? A pregnancy level of zero could mean it was not asked or it could mean no pregnancies. We don't know which. The column on pedigree is complicated and probably should be deleted and not used

Yuvaraj s s
Yuvaraj s s 01 Feb, 2023

Sir any other projects available sir?

Comments are Closed

Related Courses