Predicting if Patient has Diabetes: Classification

3 min readJun 28, 2021

According to National Diabetes Statistics Report (2020) 34.2 million people have diabetes in (10.5% of the U.S. population) and even more there are 88 million people aged 18 years or older who are considered pre-diabetes. Diabetes is a disease associated with our blood glucose(sugar) when it is too high. This is the seventh leading cause of death in the United States. It is important to know what are the risk factors that lead to diagnosis of such a fast growing disease.

Diabetes Dataset

The dataset was imported from Kaggle Machine Learning and Data Science Community, which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

Here is a head(), first 5 observations of this dataset:

First 5 observations of Diabetes Dataset

To check the dataset for missing values and some other potential issues df.info() and ProfileReport were run:

We can see there are no missing values. Dataset consists of 9 features to help us predict if the person is diabetic or not. Features include number of pregnancies, glucose level, blood pressure, skin thickness, insulin, body mass index, diabetes based on family history and age. Our target variable is Outcome, which is binary. 1 is classified as person has diabetes and 0 a person does not have diabetes. This is a classification type of problem.

Here are the correlation numbers between each pair of feature and our target. Heatmap is attached also to help us visualize. Brighter colors and closer to 1 indicate stronger correlation, we can see that glucose, BMI and age have stronger correlation with the outcome.