Predicting if Patient has Diabetes: Classification
According to National Diabetes Statistics Report (2020) 34.2 million people have diabetes in (10.5% of the U.S. population) and even more there are 88 million people aged 18 years or older who are considered pre-diabetes. Diabetes is a disease associated with our blood glucose(sugar) when it is too high. This is the seventh leading cause of death in the United States. It is important to know what are the risk factors that lead to diagnosis of such a fast growing disease.
Diabetes Dataset
The dataset was imported from Kaggle Machine Learning and Data Science Community, which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
Here is a head(), first 5 observations of this dataset:
To check the dataset for missing values and some other potential issues df.info() and ProfileReport were run:
We can see there are no missing values. Dataset consists of 9 features to help us predict if the person is diabetic or not. Features include number of pregnancies, glucose level, blood pressure, skin thickness, insulin, body mass index, diabetes based on family history and age. Our target variable is Outcome, which is binary. 1 is classified as person has diabetes and 0 a person does not have diabetes. This is a classification type of problem.
Here are the correlation numbers between each pair of feature and our target. Heatmap is attached also to help us visualize. Brighter colors and closer to 1 indicate stronger correlation, we can see that glucose, BMI and age have stronger correlation with the outcome.
Baseline
Baseline was established to be 65%. We can compare it to our predicted models and try to improve on it.
Linear Model (Logistic Regression) and Tree based model (Random Forest Classifier ) were trained and built for this Dataset.
Logistic Regression Model:
Random Forest Classification Model:
Random Forest Classifier gives slightly more accurate diabetes prediction.
Feature Importances:
Grid Search Hyperparameter was also used to optimize the random forest model and best parameters were found to be: