This article was published as a part of the Data Science Blogathon.
In this article, we’ll delve into predicting whether a patient might have diabetes or not. We’ll achieve this by feeding certain features into our machine learning model. For this task, we’ll tap into the renowned Pima Indians Diabetes Database, which holds valuable information aiding in diabetes prediction using machine learning techniques.
Diabetes prediction using machine learning means using computer programs to guess if someone might get diabetes. These programs look at different things like your health history and lifestyle to make their guess. They learn from lots of examples of people with and without diabetes to make better guesses. For example, they might look at how much sugar someone eats or if they exercise regularly. By doing this, they can give early warnings to people who might be at risk of getting diabetes so they can take better care of themselves.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from mlxtend.plotting import plot_decision_regions
import missingno as msno
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
diabetes_df = pd.read_csv('diabetes.csv') diabetes_df.head()
Output:
Now let’ see that what are columns available in our dataset.diabetes_df.columns
Output:Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’], dtype=’object’)
Information about the datasetdiabetes_df.info()
Output:RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
To know more about the datasetdiabetes_df.describe()
Output:
To know more about the dataset with transpose – here T is for the transposediabetes_df.describe().T
Output:
Now let’s check that if our dataset have null values or notdiabetes_df.isnull().head(10)
Output:
Now let’s check the number of null values our dataset has.diabetes_df.isnull().sum()
Output:Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
Here from the above code we first checked that is there any null values from the IsNull() function then we are going to take the sum of all those missing values from the sum() function and the inference we now get is that there are no missing values but that is actually not a true story as in this particular dataset all the missing values were given the 0 as a value which is not good for the authenticity of the dataset. Hence we will first replace the 0 value with the NAN value then start the imputation process.diabetes_df_copy = diabetes_df.copy(deep = True) diabetes_df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]] = diabetes_df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]].replace(0,np.NaN) # Showing the Count of NANs print(diabetes_df_copy.isnull().sum())
Output:Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
As mentioned above that now we will be replacing the zeros with the NAN values so that we can impute it later to maintain the authenticity of the dataset as well as trying to have a better Imputation approach i.e to apply mean values of each column to the null values of the respective columns.
Plotting the data distribution plots before removing null valuesp = diabetes_df.hist(figsize = (20,20))
Output:
Inference: So here we have seen the distribution of each features whether it is dependent data or independent data and one thing which could always strike that why do we need to see the distribution of data? So the answer is simple it is the best way to start the analysis of the dataset as it shows the occurrence of every kind of value in the graphical structure which in turn lets us know the range of the data.
Now we will be imputing the mean value of the column to each missing value of that particular column.diabetes_df_copy[‘Glucose’].fillna(diabetes_df_copy[‘Glucose’].mean(), inplace = True) diabetes_df_copy[‘BloodPressure’].fillna(diabetes_df_copy[‘BloodPressure’].mean(), inplace = True) diabetes_df_copy[‘SkinThickness’].fillna(diabetes_df_copy[‘SkinThickness’].median(), inplace = True) diabetes_df_copy[‘Insulin’].fillna(diabetes_df_copy[‘Insulin’].median(), inplace = True) diabetes_df_copy[‘BMI’].fillna(diabetes_df_copy[‘BMI’].median(), inplace = True)
Plotting the distributions after removing the NAN values.p = diabetes_df_copy.hist(figsize = (20,20))
Output:
Inference: Here we are again using the hist plot to see the distribution of the dataset but this time we are using this visualization to see the changes that we can see after those null values are removed from the dataset and we can clearly see the difference for example – In age column after removal of the null values, we can see that there is a spike at the range of 50 to 100 which is quite logical as well.
Plotting Null Count Analysis Plotp = msno.bar(diabetes_df)
Output:
Inference: Now in the above graph also we can clearly see that there are no null values in the dataset.
Now, let’s check that how well our outcome column is balancedcolor_wheel = {1: “#0392cf”, 2: “#7bc043”} colors = diabetes_df[“Outcome”].map(lambda x: color_wheel.get(x + 1)) print(diabetes_df.Outcome.value_counts()) p=diabetes_df.Outcome.value_counts().plot(kind=”bar”)
Output:0 500 1 268 Name: Outcome, dtype: int64
Inference: Here from the above visualization it is clearly visible that our dataset is completely imbalanced in fact the number of patients who are diabetic is half of the patients who are non-diabetic.plt.subplot(121), sns.distplot(diabetes_df[‘Insulin’]) plt.subplot(122), diabetes_df[‘Insulin’].plot.box(figsize=(16,5)) plt.show()
Output:
Inference: That’s how Distplot can be helpful where one will able to see the distribution of the data as well as with the help of boxplot one can see the outliers in that column and other information too which can be derived by the box and whiskers plot.
Correlation between all the features before cleaningplt.figure(figsize=(12,10)) # seaborn has an easy method to showcase heatmap p = sns.heatmap(diabetes_df.corr(), annot=True,cmap =’RdYlGn’)
Output:
Before scaling down the data let’s have a look into itdiabetes_df_copy.head()
Output:
After Standard scaling
sc_X = StandardScaler() X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop([“Outcome”],axis = 1),), columns=[‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’]) X.head()
Output:
That’s how our dataset will be looking like when it is scaled down or we can see every value now is on the same scale which will help our ML model to give a better result.
Let’s explore our target columny = diabetes_df_copy.Outcome y
Output:0 1 1 0 2 1 3 0 4 1 .. 763 0 764 0 765 0 766 1 767 0 Name: Outcome, Length: 768, dtype: int64
Splitting the datasetX = diabetes_df.drop(‘Outcome’, axis=1) y = diabetes_df[‘Outcome’]
Now we will split the data into training and testing data using the train_test_split functionfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=7)
Building the model using RandomForestfrom sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=200) rfc.fit(X_train, y_train)
Now after building the model let’s check the accuracy of the model on the training dataset.rfc_train = rfc.predict(X_train) from sklearn import metrics print(“Accuracy_Score =”, format(metrics.accuracy_score(y_train, rfc_train)))
Output: Accuracy = 1.0
So here we can see that on the training dataset our model is overfitted.
Getting the accuracy score for Random Forestfrom sklearn import metrics predictions = rfc.predict(X_test) print(“Accuracy_Score =”, format(metrics.accuracy_score(y_test, predictions)))
Output:Accuracy_Score = 0.7677165354330708
Classification report and confusion matrix of random forest modelfrom sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions))
Output:
Building the model using DecisionTreefrom sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(X_train, y_train)
Now we will be making the predictions on the testing data directly as it is of more importance.
Getting the accuracy score for Decision Treefrom sklearn import metrics predictions = dtree.predict(X_test) print(“Accuracy Score =”, format(metrics.accuracy_score(y_test,predictions)))
Output:Accuracy Score = 0.7322834645669292
Classification report and confusion matrix of the decision tree model
from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions))
Output:
Building model using XGBoostfrom xgboost import XGBClassifier xgb_model = XGBClassifier(gamma=0) xgb_model.fit(X_train, y_train)
Output:
Now we will be making the predictions on the testing data directly as it is of more importance.
Getting the accuracy score for the XgBoost classifierfrom sklearn import metrics xgb_pred = xgb_model.predict(X_test) print(“Accuracy Score =”, format(metrics.accuracy_score(y_test, xgb_pred)))
Output:Accuracy Score = 0.7401574803149606
Classification report and confusion matrix of the XgBoost classifier
from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, xgb_pred)) print(classification_report(y_test,xgb_pred))
Output:
Building the model using Support Vector Machine (SVM)from sklearn.svm import SVC svc_model = SVC() svc_model.fit(X_train, y_train)
Prediction from support vector machine model on the testing datasvc_pred = svc_model.predict(X_test)
Accuracy score for SVMfrom sklearn import metrics print(“Accuracy Score =”, format(metrics.accuracy_score(y_test, svc_pred)))
Output:Accuracy Score = 0.7401574803149606
Classification report and confusion matrix of the SVM classifierfrom sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, svc_pred)) print(classification_report(y_test,svc_pred))
Output:
Therefore Random forest is the best model for this prediction since it has an accuracy_score of 0.76.
Knowing about the feature importance is quite necessary as it shows that how much weightage each feature provides in the model building phase.
Getting feature importancesrfc.feature_importances_
Output:array([0.07684946, 0.25643635, 0.08952599, 0.08437176, 0.08552636, 0.14911634, 0.11751284, 0.1406609 ])
From the above output, it is not much clear that which feature is important for that reason we will now make a visualization of the same.
Plotting feature importances(pd.Series(rfc.feature_importances_, index=X.columns).plot(kind=’barh’))
Output:
Here from the above graph, it is clearly visible that Glucose as a feature is the most important in this dataset.
import pickle # Firstly we will be using the dump() function to save the model using pickle saved_model = pickle.dumps(rfc) # Then we will be loading that saved model rfc_from_pickle = pickle.loads(saved_model) # lastly, after loading that model we will use this to make predictions rfc_from_pickle.predict(X_test)
Output:
Now for the last time, I’ll be looking at the head and tail of the dataset so that we can take any random set of features from both the head and tail of the data to test that if our model is good enough to give the right prediction.diabetes_df.head()
Output:
diabetes_df.tail()
Output:
Putting data points in the model will either return 0 or 1 i.e. person suffering from diabetes or not.rfc.predict([[0,137,40,35,168,43.1,2.228,33]]) #4th patient
Output:array([1], dtype=int64)
Another onerfc.predict([[10,101,76,48,180,32.9,0.171,63]]) # 763 th patient
Output:array([0], dtype=int64)
Data gathering: Collect a thorough dataset with details on individuals’ health records, daily routines, and physical measurements related to predicting diabetes through machine learning.
Data preprocessing involves eliminating inconsistencies and errors to clean the data. This measure guarantees the dataset’s appropriateness for training machine learning algorithms in detecting diabetes through machine learning techniques.
Feature Selection: Recognize and choose important attributes like blood sugar levels, BMI, family history, and age. These characteristics are essential for accurately predicting diabetes.
Train models with machine learning algorithms: Such as random forest or neural networks using the prepared dataset for model training. While being trained, the models are taught to identify patterns that suggest the presence of diabetes by studying examples.
Evaluate the trained models : Performance by using metrics such as accuracy, precision, recall, and F1-score. This measure guarantees the accuracy of diabetes prediction by the models.
Prediction of diabetes: Utilize the trained machine learning models to forecast the probability of individuals experiencing diabetes according to their provided data.
Continuous Monitoring: Set up a mechanism to continuously monitor and revise the models with the arrival of new data. This guarantees that the models stay precise and applicable for forecasting diabetes with machine learning in actual situations.
After analyzing all these patient records, we’ve developed a machine learning model (specifically, random forest, which performed the best) that can effectively predict whether individuals in the dataset have diabetes. Alongside this, we’ve gained valuable insights from the data through analysis and visualization, aiding in the prediction of diabetes using machine learning techniques.
Here’s the repo link to this article.
Machine learning techniques such as decision trees, logistic regression, neural networks, and random forests are commonly used for diabetes prediction. These algorithms examine data about blood sugar levels and lifestyle choices to predict the probability of developing diabetes. This method is referred to as machine learning for predicting diabetes.
Support Vector Machines (SVM) were selected for predicting diabetes because of their capability to deal with intricate datasets with high dimensionality. SVM efficiently categorizes individuals into diabetic and non-diabetic groups using different input factors, aiding in the creation of precise diabetes prediction models.
Truly, artificial intelligence, particularly through the use of machine learning, can effectively detect diabetes by analyzing the medical records and physical symptoms of patients. This method known as machine learning can predict diabetes, enabling early identification of at-risk individuals and facilitating the implementation of preventive measures and personalized healthcare plans.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
I gone through this Diabetes Prediction Using Machine Learning and noticed below observation. Not sure, I observed correctly but, I noticed something in the below code. Before Model Building, by using the StandardScaler, you scaled the data for input features 'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age' and assigned to X variable.And, from diabetes_df_copy dataframe you took the Outcome feature into y. But, when you started Model Building to split the data using Train test split, you utilized the diabetes_df dataframe (which is not imputed version of dataframe) instead of using already scaled data in X. And also, before you impute the data, you copied the data from diabetes_df into diabetes_df_copy dataframe. I think you supposed to use scaled data in X for all ML Model builds. Can you please correct me if I observed incorrectly.
Good demonstration of a real-world machine learning process. As a clinician I have concerns about using this dataset without some medical expertise. For example, you can't have triceps thickness or insulin levels of zero. This means the test was not done. Is imputation legitimate in this situation? A pregnancy level of zero could mean it was not asked or it could mean no pregnancies. We don't know which. The column on pedigree is complicated and probably should be deleted and not used
Sir any other projects available sir?
Comments are Closed