Machine Learning Reflective Report

BAA1027 Data Analytics: Machine Learning and Advanced Python

Sadhana R Sambandam

About the Data

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.This contains information of 768 women from a population near Phoenix, Arizona, USA. The outcome tested was Diabetes, 258 tested positive and 500 tested negative. (Found on Kaggle)

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Information on Data Set

- Pregnancies: nformation about the number of pregnancy female had till date

- Glucose: glucose level of patient generally higher glucose level shows the chances of sugar in the blood. This is blood sugar level measured two hours after consuming 75g of glucose (mg/dL).

- BloodPressure: Diastolic Blood Pressure in (mmHg) from the patient.

- SkinThickness: Skin thickness of patient, most likely triceps skin fold thickness in mm.

- Insulin: Insulin level of patient (2 h serum insulin in mu U/ml).

- BMI: The full form of BMI is Body Mass Index. It is a method of measuring through which individuals are classified into the following categories: obese, overweight, normal weight and underweight (kg/m2).

- DiabetesPedigreeFunction: Diabetes pedigree function ( DiabetesPedigreeFunction) The Diabetes Pedigree Function, pedi, provides some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence give us an idea of the hereditary risk one might have with the onset of diabetes mellitus.

- Age: Age of patient. (years)

- Outcome: 0 shows that patient doesn’t have diabetes, 1 shows patient have diabetes.

Who are Pima Indians?

The Pima people, originally from the Sonoran Desert and Sierra Madre regions, migrated to the Gila River Valley around 300 B.C. This area, once part of Mexico, became U.S. territory in 1853. A Pima reservation was established in Arizona in 1959, where they practiced subsistence farming. However, by 1900, white settlers diverted water, disrupting their agriculture and lifestyle. With reduced physical labor and food scarcity, their diet became high in fat, leading to a sedentary lifestyle and a rise in type 2 diabetes (T2DM). The Arizona Pimas had the highest recorded prevalence of T2DM, prompting studies by the National Institute of Diabetes and Digestive and Kidney Diseases since 1965. The Pima Indians Diabetes Dataset captures genetic and environmental factors related to diabetes onset and complications.

Introduction to the Project

In this coding project, my goal was to build a machine learning model to predict whether a person has diabetes based on various medical features. Can machine learning help accurately predict if someone has diabetes?

To answer this, I started by loading the diabetes dataset into a Python environment using Google Colab. Then I cleaned the data, handled missing values, and split the dataset into training and testing sets.

Why is it important to preprocess data before training machine learning models? Proper preprocessing ensures the models learn effectively from the data. After preprocessing, I selected three machine learning models;

Logistic Regression
Random Forest
Support Vector Machine (SVM)

and trained each on the training set. Next, I evaluated their performance using metrics such as accuracy, precision, recall, and F1-score. Which machine learning model provides the best predictions for diabetes?

To further improve the predictions, I performed hyperparameter tuning. Finally, I visualized relationships between different features, helping to better understand factors contributing to diabetes. This project provided valuable insights into the complete process of creating and evaluating machine learning models.

Data Loading, Exploration, and Preprocessing

In this first step, I imported essential Python libraries like pandas, NumPy, seaborn, matplotlib, and tools from scikit-learn. I then loaded the diabetes dataset into a pandas DataFrame using Google Colab. After loading the dataset, I explored it by checking its shape, previewing its content, and calculating descriptive statistics.

To better understand the relationships between features, I visualized the correlations using a correlation matrix and created pairwise scatter plots. Next, I cleaned the dataset by replacing unrealistic zero values in columns like Glucose, BloodPressure, SkinThickness, Insulin, and BMI with the median of each respective column.

After cleaning, I separated the features from the target column (“Outcome”), divided the dataset into training and testing sets, and applied feature scaling using StandardScaler to improve model performance. This careful preprocessing ensured the data was clean, well-understood, and ready for accurate modeling.

# Import Necessary Libraries
import pandas as pd # For data handling and manipulation
import numpy as np  # For numerical operations
from sklearn.model_selection import train_test_split  # For splitting data into training and test sets
from sklearn.preprocessing import StandardScaler      # For scaling feature values

##Understanding the Data

# This loads the CSV file 'diabetes.csv' into a Pandas DataFrame.
df = pd.read_csv('diabetes.csv')
print("Dataset Shape:", df.shape)  # Display the dimensions of the dataset
print(df.head())                   # Display the first few rows to see what the data looks like

Dataset Shape: (768, 9)
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1

import plotly.express as px

def plot_column_distribution(df, column_name):
    fig = px.histogram(df, x=column_name)
    fig.update_layout(
        xaxis_title="Value",
        yaxis_title="Frequency",
        title=column_name,
        showlegend=False,
        font=dict(size=14)
    )
    fig.show()

for col in df.columns[1:-2]:
    plot_column_distribution(df, col)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Unable to display output for mime type(s): application/vnd.plotly.v1+json

# Understand general aspects on the data
df.describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

# Calculate the correlation matrix for all columns in the dataset
corr = df.corr()
corr

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Pregnancies	1.000000	0.129459	0.141282	-0.081672	-0.073535	0.017683	-0.033523	0.544341	0.221898
Glucose	0.129459	1.000000	0.152590	0.057328	0.331357	0.221071	0.137337	0.263514	0.466581
BloodPressure	0.141282	0.152590	1.000000	0.207371	0.088933	0.281805	0.041265	0.239528	0.065068
SkinThickness	-0.081672	0.057328	0.207371	1.000000	0.436783	0.392573	0.183928	-0.113970	0.074752
Insulin	-0.073535	0.331357	0.088933	0.436783	1.000000	0.197859	0.185071	-0.042163	0.130548
BMI	0.017683	0.221071	0.281805	0.392573	0.197859	1.000000	0.140647	0.036242	0.292695
DiabetesPedigreeFunction	-0.033523	0.137337	0.041265	0.183928	0.185071	0.140647	1.000000	0.033561	0.173844
Age	0.544341	0.263514	0.239528	-0.113970	-0.042163	0.036242	0.033561	1.000000	0.238356
Outcome	0.221898	0.466581	0.065068	0.074752	0.130548	0.292695	0.173844	0.238356	1.000000

# Show correlation matrix with background colors
corr.style.background_gradient(cmap = 'coolwarm')

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Pregnancies	1.000000	0.129459	0.141282	-0.081672	-0.073535	0.017683	-0.033523	0.544341	0.221898
Glucose	0.129459	1.000000	0.152590	0.057328	0.331357	0.221071	0.137337	0.263514	0.466581
BloodPressure	0.141282	0.152590	1.000000	0.207371	0.088933	0.281805	0.041265	0.239528	0.065068
SkinThickness	-0.081672	0.057328	0.207371	1.000000	0.436783	0.392573	0.183928	-0.113970	0.074752
Insulin	-0.073535	0.331357	0.088933	0.436783	1.000000	0.197859	0.185071	-0.042163	0.130548
BMI	0.017683	0.221071	0.281805	0.392573	0.197859	1.000000	0.140647	0.036242	0.292695
DiabetesPedigreeFunction	-0.033523	0.137337	0.041265	0.183928	0.185071	0.140647	1.000000	0.033561	0.173844
Age	0.544341	0.263514	0.239528	-0.113970	-0.042163	0.036242	0.033561	1.000000	0.238356
Outcome	0.221898	0.466581	0.065068	0.074752	0.130548	0.292695	0.173844	0.238356	1.000000

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('diabetes.csv')
print("Dataset loaded. Shape:", df.shape)

# Set the visual style for the pairplot
sns.set(style="whitegrid", context="notebook")

# Create a pairplot for the dataset
# 'hue' colors points according to the 'Outcome' variable
# 'diag_kind' set to 'kde' will plot kernel density estimates on the diagonal
pair_plot = sns.pairplot(df, hue='Outcome', diag_kind='kde', corner=True)

# Adjust the overall plot title and layout
pair_plot.fig.suptitle("Pairwise Scatter Plots for Every Feature", y=1.02)

# Display the plot
plt.show()

Dataset loaded. Shape: (768, 9)

Cleaning the Data

I cleaned the data by identifying any missing or unrealistic values, particularly zeros in columns such as Glucose, BloodPressure, SkinThickness, Insulin, and BMI. I replaced these zeros with NaN (Not a Number) and then filled them with each column’s median value to avoid distorting the overall distribution. This step was crucial to ensure the models trained on accurate, representative data rather than invalid entries.

# Data Cleaning: Handling Missing Values
# Some features in the diabetes dataset are recorded as 0, which is not a realistic value.
# Here, we define a list of columns where a value of 0 likely represents missing data.
columns_with_zero = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for column in columns_with_zero:
    # Replace 0 with NaN (Not a Number) to indicate missing values
    df[column] = df[column].replace(0, np.nan)
    # Calculate the median of the column (ignoring NaN values)
    median_val = df[column].median()
    # Fill missing values (NaN) with the column's median value
    df[column] = df[column].fillna(median_val)
    print(f"Replaced missing values in {column} with median value {median_val}")

Replaced missing values in Glucose with median value 117.0
Replaced missing values in BloodPressure with median value 72.0
Replaced missing values in SkinThickness with median value 29.0
Replaced missing values in Insulin with median value 125.0
Replaced missing values in BMI with median value 32.3

Feature Scaling

I applied feature scaling using StandardScaler to ensure that all features operated on a similar scale. This process subtracts the mean and divides by the standard deviation for each feature, preventing features with large numeric ranges from overshadowing those with smaller ranges. As a result, the training process becomes more stable and efficient for algorithms that are sensitive to differences in feature magnitudes.

# Feature and Target Separation
# 'Outcome' is the target column (1 indicates diabetic, 0 indicates non-diabetic)
# All other columns are considered as features.
X = df.drop('Outcome', axis=1) # Features: Drop the target column from the DataFrame
y = df['Outcome']              # Target: Keep the 'Outcome' column

# Feature Scaling
# Scaling helps many machine learning algorithms perform better.
# StandardScaler transforms features by removing the mean and scaling to unit variance.
scaler = StandardScaler()
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# Feature Scaling using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test = scaler.transform(X_test)        # Transform the test data using the same scaler

print("Feature scaling complete!")

Training set shape: (614, 8)
Test set shape: (154, 8)
Feature scaling complete!

Model Selection and Training

After preparing the data, I moved on to selecting and training three different machine learning models: - Logistic Regression - Random Forest - Support Vector Machine (SVM).

First, I set up each model with a random seed for consistent results. Then, I trained them on the cleaned training data by calling fit, so they could learn patterns in the features.

Finally, I generated predictions on the unseen test data and evaluated each model’s performance by calculating metrics such as accuracy and examining detailed classification reports. This process let me compare how well each model could predict diabetes and identify any strengths or weaknesses in their classifications.

# Import necessary libraries for models and evaluation metrics
from sklearn.linear_model import LogisticRegression         # For Logistic Regression model
from sklearn.ensemble import RandomForestClassifier         # For Random Forest model
from sklearn.svm import SVC                                 # For Support Vector Machine model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Logistic Regression

This is a simple yet powerful linear model that estimates the probability of a binary outcome (in this case, diabetes). I chose it because it is highly interpretable, easy to implement, and effective for smaller or reasonably clean datasets.

I selected logistic regression because it is a straightforward method for binary classification and is easy to interpret. It also tends to perform well on smaller, relatively clean datasets.

# Train Logistic Regression Model
# Create an instance of the Logistic Regression classifier
log_reg = LogisticRegression(random_state=42)

# Train (fit) the model on the training data
log_reg.fit(X_train, y_train)

LogisticRegression(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Make predictions on the test set
y_pred_log = log_reg.predict(X_test)

# Evaluate the model's performance using accuracy score and a detailed classification report
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print("Classification Report for Logistic Regression:")
print(classification_report(y_test, y_pred_log))

Logistic Regression Accuracy: 0.7532467532467533
Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.80      0.83      0.81        99
           1       0.67      0.62      0.64        55

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.73       154
weighted avg       0.75      0.75      0.75       154

Random Forest Classifier

Random forest is an ensemble of decision trees that reduces overfitting by averaging multiple tree outcomes. It can model complex patterns in the data and also reveals feature importances, making it easier to see which predictors have the biggest impact on diabetes risk.

I chose random forest because it combines multiple decision trees, reducing the risk of overfitting and allowing the model to capture complex, non-linear relationships. It also provides helpful insights into which features are most important.

# Train Random Forest Classifier
# Create an instance of the Random Forest classifier
rf_clf = RandomForestClassifier(random_state=42)

# Train the model on the training data
rf_clf.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

# Import necessary libraries (if not already imported)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Create an instance of the Random Forest classifier
rf_clf = RandomForestClassifier(random_state=42)

# Train the model on the training data
rf_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_clf.predict(X_test)

# Evaluate performance of the Random Forest model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report for Random Forest:")
print(classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.7337662337662337
Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.80      0.79      0.79        99
           1       0.62      0.64      0.63        55

    accuracy                           0.73       154
   macro avg       0.71      0.71      0.71       154
weighted avg       0.73      0.73      0.73       154

Support Vector Machine (SVM)

SVM attempts to find the optimal boundary (or hyperplane) that separates the classes. It performs well in high-dimensional spaces and can handle both linear and non-linear decision boundaries with different kernels, making it a versatile choice for classification tasks like diabetes prediction.

I included SVM due to its strong performance on various classification problems, particularly when using the right kernel. It is also capable of drawing clear boundaries between classes, even in high-dimensional spaces.

# Train Support Vector Machine (SVM) Model
# Create an instance of the SVM classifier
svm_clf = SVC(random_state=42)

# Train the SVM on the training data
svm_clf.fit(X_train, y_train)

SVC(random_state=42)

# Make predictions on the test set
y_pred_svm = svm_clf.predict(X_test)

Evaluate SVM

To evaluate my Support Vector Machine model, I computed its accuracy on the test set and viewed a detailed classification report. The accuracy metric showed the percentage of correct predictions, while the classification report provided precision, recall, and F1-score for both classes. This detailed breakdown helped me understand how well the SVM predicted diabetic versus non-diabetic outcomes.

# Evaluate performance of the SVM model
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report for SVM:")
print(classification_report(y_test, y_pred_svm))

SVM Accuracy: 0.7467532467532467
Classification Report for SVM:
              precision    recall  f1-score   support

           0       0.78      0.84      0.81        99
           1       0.67      0.58      0.62        55

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.72       154
weighted avg       0.74      0.75      0.74       154

Confusion Matrix

A confusion matrix is a table that displays the performance of a classification model by showing the number of correct and incorrect predictions. It organizes data into four key categories: true positives, false positives, true negatives, and false negatives, providing a clear visual summary of how many instances were classified correctly and where the misclassifications occurred.

I used the confusion matrix because it offers a deeper understanding of the model’s performance beyond simple accuracy. In scenarios such as diabetes prediction, it’s important to know not just how many predictions are right, but also the types of errors made. This detailed insight helps in assessing the impact of misclassifications, which is especially critical when false negatives or false positives could have serious consequences in a medical diagnosis.

# Import necessary libraries for evaluation and visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, classification_report, accuracy_score

#Following variables defined:
# - y_test: True labels for the test set
# - y_pred_log: Predictions from the Logistic Regression model
# - y_pred_rf: Predictions from the Random Forest classifier
# - y_pred_svm: Predictions from the SVM model

# Create a dictionary to hold model names and their predictions for easy iteration
evaluation_models = {
    "Logistic Regression": y_pred_log,
    "Random Forest": y_pred_rf,
    "SVM": y_pred_svm
}

# Plot a confusion matrix for each model
for model_name, predictions in evaluation_models.items():
    cm = confusion_matrix(y_test, predictions)
    plt.figure()
    sns.heatmap(cm, annot=True, fmt="d", cmap="viridis")
    plt.title(f"{model_name} Confusion Matrix")
    plt.xlabel("Predicted Labels")
    plt.ylabel("Actual Labels")
    plt.show()

Additional Evaluation Metrics

This section prints detailed classification reports for the Logistic Regression, Random Forest, and SVM models. These reports display key performance metrics such as precision, recall, F1-score, and support for each class. This information gives a deeper insight into how each model performs on both diabetic and non-diabetic cases, offering more context than overall accuracy alone.

# Additional Evaluation Metrics
# Print classification reports for a detailed performance summary
print("Classification Report for Logistic Regression:")
print(classification_report(y_test, y_pred_log))
print("Classification Report for Random Forest:")
print(classification_report(y_test, y_pred_rf))
print("Classification Report for SVM:")
print(classification_report(y_test, y_pred_svm))

Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.80      0.83      0.81        99
           1       0.67      0.62      0.64        55

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.73       154
weighted avg       0.75      0.75      0.75       154

Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.80      0.79      0.79        99
           1       0.62      0.64      0.63        55

    accuracy                           0.73       154
   macro avg       0.71      0.71      0.71       154
weighted avg       0.73      0.73      0.73       154

Classification Report for SVM:
              precision    recall  f1-score   support

           0       0.78      0.84      0.81        99
           1       0.67      0.58      0.62        55

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.72       154
weighted avg       0.74      0.75      0.74       154

# ROC Curve and ROC AUC
# Calculate the predicted probabilities for the positive class.
# (Logistic Regression inherently supports probability estimates.)
y_prob_log = log_reg.predict_proba(X_test)[:, 1]

# Compute the ROC AUC score
roc_auc_log = roc_auc_score(y_test, y_prob_log)
print("Logistic Regression ROC AUC:", roc_auc_log)

Logistic Regression ROC AUC: 0.82277318640955

ROC Curve Graph for Logistic Regression

This graph shows the ROC (Receiver Operating Characteristic) curve for my Logistic Regression model. The curve helps visualize how well the model can separate diabetic and non-diabetic cases. The blue line represents the model’s performance, while the red dashed line shows what a random guess would look like. The closer the blue line is to the top-left corner, the better the model is at making correct predictions. The AUC (Area Under the Curve) value is 0.82, which means the model has an 82% chance of correctly distinguishing between the two classes. This indicates that the model is performing well and is reliable for this classification task.

# Generate the ROC curve data: false positive rates and true positive rates
fpr, tpr, thresholds = roc_curve(y_test, y_prob_log)

# Plot the ROC curve for Logistic Regression
plt.figure()
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {roc_auc_log:.2f})')
plt.plot([0, 1], [0, 1], 'r--')  # Dashed diagonal line as a reference (random classifier)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression')
plt.legend(loc="lower right")
plt.show()

Hyperparameters

Hyperparameters are settings that control how a machine learning model learns from the data. Unlike regular model parameters, they are not learned during training but are set before the model is run. I used hyperparameter tuning to improve the performance of my models by finding the best combination of settings, such as regularization strength or the number of decision trees. This helps the models generalize better to new, unseen data and reduces the chances of overfitting or underfitting. Using tools like GridSearchCV allowed me to test different values systematically and choose the ones that gave the highest accuracy.

# Import necessary tools for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

Hyperparameter Tuning for Logistic Regression

To improve the performance of my Logistic Regression model, I used hyperparameter tuning with GridSearchCV. I tested different values for the regularization strength parameter C, which controls how much the model tries to avoid overfitting. I used only the ‘l2’ penalty, which is commonly used with Logistic Regression. GridSearchCV ran multiple combinations using 5-fold cross-validation to find the best settings. The model with the best parameters was then evaluated on the test set, and I used its accuracy score to see how well the tuned model performed compared to the original. This process helped me get the most out of the Logistic Regression model.

# Hyperparameter Tuning for Logistic Regression

# Define a parameter grid for Logistic Regression:
# - 'C' is the inverse of regularization strength.
# - We use only the 'l2' penalty (note: if you use 'l1', be sure to choose a compatible solver).
param_grid_log = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2']
}

# Instantiate GridSearchCV for Logistic Regression.
# Use solver 'liblinear', which works well with small datasets and supports l2 regularization.
log_grid = GridSearchCV(
    LogisticRegression(random_state=42, solver='liblinear'),
    param_grid_log,
    cv=5,                # 5-fold cross-validation
    scoring='accuracy'   # Use accuracy as the performance metric
)

# Fit on the training data
log_grid.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=LogisticRegression(random_state=42, solver='liblinear'),
             param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l2']},
             scoring='accuracy')

# Display the best parameters and best cross-validation score
print("Logistic Regression best parameters:", log_grid.best_params_)
print("Logistic Regression best CV score:", log_grid.best_score_)

Logistic Regression best parameters: {'C': 1, 'penalty': 'l2'}
Logistic Regression best CV score: 0.7687724910035986

# Retrieve the best Logistic Regression model and evaluate on the test set
best_log = log_grid.best_estimator_
y_pred_best_log = best_log.predict(X_test)
print("Test set accuracy for best Logistic Regression:", accuracy_score(y_test, y_pred_best_log))

Test set accuracy for best Logistic Regression: 0.7532467532467533

Hyperparameter Tuning for Random Forest

For the Random Forest model, I used GridSearchCV to tune important hyperparameters that control how the model builds and combines decision trees. I tested different values for n_estimators (the number of trees), max_depth (how deep each tree can go), and min_samples_split (the minimum number of samples needed to split a node). By using 5-fold cross-validation, GridSearchCV helped me find the combination that gave the best accuracy. This tuning process allowed the Random Forest model to perform better on the test data by finding the right balance between complexity and generalization.

# Define a parameter grid for Random Forest:
param_grid_rf = {
    'n_estimators': [50, 100, 200],       # Number of trees in the forest
    'max_depth': [None, 5, 10],             # Maximum depth of the trees
    'min_samples_split': [2, 5]             # Minimum number of samples required to split an internal node
}

# Instantiate GridSearchCV for Random Forest
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=5,
    scoring='accuracy'
)

# Fit on the training data
rf_grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [None, 5, 10],
                         'min_samples_split': [2, 5],
                         'n_estimators': [50, 100, 200]},
             scoring='accuracy')

# Retrieve the best Random Forest model and evaluate on the test set
best_rf = rf_grid.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)
print("Test set accuracy for best Random Forest:", accuracy_score(y_test, y_pred_best_rf))

Test set accuracy for best Random Forest: 0.7337662337662337

Hyperparameter Tuning for SVM

For the SVM model, I used GridSearchCV to explore different combinations of key hyperparameters. I tested values for C (which controls regularization), different kernel types (linear and rbf), and gamma (which affects how far the influence of a single training example reaches). I also set probability=True so I could later use the model to generate probabilities for ROC curve evaluation. Using 5-fold cross-validation, GridSearchCV helped identify the best settings that gave the highest accuracy. This tuning improved the SVM’s ability to separate the classes effectively on the test data.

# For SVM, need to set probability=True if you plan to use predict_proba later (e.g., for ROC curves).
param_grid_svm = {
    'C': [0.1, 1, 10],                # Regularization parameter
    'kernel': ['linear', 'rbf'],      # Try a linear kernel and a radial basis function (RBF)
    'gamma': ['scale', 'auto']        # Kernel coefficient options for 'rbf'
}

# Instantiate GridSearchCV for SVM
svm_grid = GridSearchCV(
    SVC(probability=True, random_state=42),
    param_grid_svm,
    cv=5,
    scoring='accuracy'
)

# Fit on the training data
svm_grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=SVC(probability=True, random_state=42),
             param_grid={'C': [0.1, 1, 10], 'gamma': ['scale', 'auto'],
                         'kernel': ['linear', 'rbf']},
             scoring='accuracy')

# Display the best parameters and best cross-validation score
print("\nSVM best parameters:", svm_grid.best_params_)
print("SVM best CV score:", svm_grid.best_score_)


SVM best parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
SVM best CV score: 0.765480474476876

# Retrieve the best SVM model and evaluate on the test set
best_svm = svm_grid.best_estimator_
y_pred_best_svm = best_svm.predict(X_test)
print("Test set accuracy for best SVM:", accuracy_score(y_test, y_pred_best_svm))

Test set accuracy for best SVM: 0.7532467532467533

Saving the Model and Comparing Performance

In this section, I used the pickle library to save my best-performing model—in this case, the Random Forest model—to a file named best_model.pkl. This is useful because it allows me to reuse the trained model later without having to retrain it again. I can load the saved model back into memory using pickle.load() whenever needed.

After saving the model, I calculated key evaluation metrics, accuracy, precision, recall, and F1-score—for each of the three models: Logistic Regression, Random Forest, and SVM. These metrics give a detailed overview of how well each model performed on the test data. I organized these results into a DataFrame for easy comparison.

Finally, I created a simple bar chart using matplotlib to visually compare the accuracy of each model. This plot helps quickly identify which model achieved the highest performance, making it easier to choose the most effective one for the diabetes prediction task.

import pickle

# Save the best Random Forest model
with open('best_model.pkl', 'wb') as file:
    pickle.dump(best_rf, file)

import pandas as pd
from sklearn.metrics import f1_score, precision_score, recall_score

# Calculate evaluation metrics for each model (using the best models or initial ones)
metrics = {
    'Model': ['Logistic Regression', 'Random Forest', 'SVM'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_best_log),
        accuracy_score(y_test, y_pred_best_rf),
        accuracy_score(y_test, y_pred_best_svm)
    ],
    'Precision': [
        precision_score(y_test, y_pred_best_log),
        precision_score(y_test, y_pred_best_rf),
        precision_score(y_test, y_pred_best_svm)
    ],
    'Recall': [
        recall_score(y_test, y_pred_best_log),
        recall_score(y_test, y_pred_best_rf),
        recall_score(y_test, y_pred_best_svm)
    ],
    'F1 Score': [
        f1_score(y_test, y_pred_best_log),
        f1_score(y_test, y_pred_best_rf),
        f1_score(y_test, y_pred_best_svm)
    ]
}

# Create a DataFrame and display it
summary_df = pd.DataFrame(metrics)
print(summary_df)

                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.753247   0.666667  0.618182  0.641509
1        Random Forest  0.733766   0.625000  0.636364  0.630631
2                  SVM  0.753247   0.666667  0.618182  0.641509

# Plotting model accuracies as an example
plt.figure()
plt.bar(summary_df['Model'], summary_df['Accuracy'], color='skyblue')
plt.ylabel('Accuracy')
plt.title('Comparison of Model Accuracies')
plt.show()

Evaluation Metrics

Here, four bar charts show how Logistic Regression, Random Forest, and SVM perform on four evaluation metrics: Accuracy, Precision, Recall, and F1 Score.

From these bars, we can see that Logistic Regression and SVM both achieve the highest accuracy at about 0.753, with Random Forest slightly behind at 0.734. In terms of precision, Logistic Regression and SVM again lead at roughly 0.667 versus Random Forest’s 0.625, while Random Forest edges out the other two on recall (≈ 0.636 vs. ≈ 0.618). F1 scores mirror precision: Logistic Regression and SVM sit at approximately 0.642, and Random Forest at 0.631.

Overall, although Logistic Regression and SVM tie for top performance on most measures and Random Forest offers a modest recall boost, none of the models exceeds 0.80 on any metric, suggesting that further hyperparameter tuning or more advanced feature engineering may be needed to improve results.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# If your prediction arrays and y_test are not already defined, the following block simulates them.
# Remove or comment this block when you use your actual model predictions.
try:
    y_pred_best_log
except NameError:
    np.random.seed(42)
    # Simulate predictions for demonstration (replace size with your test set length)
    y_pred_best_log = np.random.choice([0, 1], size=154)
    y_pred_best_rf = np.random.choice([0, 1], size=154)
    y_pred_best_svm = np.random.choice([0, 1], size=154)

try:
    y_test
except NameError:
    y_test = np.random.choice([0, 1], size=154)

# Calculate evaluation metrics for each model dynamically from the predictions.
accuracy_values = [
    accuracy_score(y_test, y_pred_best_log),
    accuracy_score(y_test, y_pred_best_rf),
    accuracy_score(y_test, y_pred_best_svm)
]

precision_values = [
    precision_score(y_test, y_pred_best_log, zero_division=0),
    precision_score(y_test, y_pred_best_rf, zero_division=0),
    precision_score(y_test, y_pred_best_svm, zero_division=0)
]

recall_values = [
    recall_score(y_test, y_pred_best_log, zero_division=0),
    recall_score(y_test, y_pred_best_rf, zero_division=0),
    recall_score(y_test, y_pred_best_svm, zero_division=0)
]

f1_values = [
    f1_score(y_test, y_pred_best_log, zero_division=0),
    f1_score(y_test, y_pred_best_rf, zero_division=0),
    f1_score(y_test, y_pred_best_svm, zero_division=0)
]

# Create the DataFrame using the computed metric values.
models = ['Logistic Regression', 'Random Forest', 'SVM']
df_metrics = pd.DataFrame({
    'Model': models,
    'Accuracy': accuracy_values,
    'Precision': precision_values,
    'Recall': recall_values,
    'F1 Score': f1_values
})

print("Evaluation Metrics for Each Model:")
print(df_metrics)

Evaluation Metrics for Each Model:
                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.753247   0.666667  0.618182  0.641509
1        Random Forest  0.733766   0.625000  0.636364  0.630631
2                  SVM  0.753247   0.666667  0.618182  0.641509

# Example DataFrame with evaluation metrics for demonstration
# Replace these sample values with the actual metrics derived from your model predictions.
df_metrics = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'SVM'],
    'Accuracy': [0.753247, 0.733766, 0.753247],
    'Precision': [ 0.666667, 0.625000, 0.666667],
    'Recall': [0.618182, 0.636364, 0.618182],
    'F1 Score': [0.641509, 0.630631, 0.641509]
})

# Set 'Model' as the index for better plotting
df_metrics.set_index('Model', inplace=True)

# Create subplots for each metric
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

# Plot Accuracy
df_metrics['Accuracy'].plot(kind='bar', ax=axs[0, 0], color='skyblue')
axs[0, 0].set_title('Model Accuracy')
axs[0, 0].set_ylim(0, 1)
axs[0, 0].grid(axis='y')

# Plot Precision
df_metrics['Precision'].plot(kind='bar', ax=axs[0, 1], color='lightgreen')
axs[0, 1].set_title('Model Precision')
axs[0, 1].set_ylim(0, 1)
axs[0, 1].grid(axis='y')

# Plot Recall
df_metrics['Recall'].plot(kind='bar', ax=axs[1, 0], color='salmon')
axs[1, 0].set_title('Model Recall')
axs[1, 0].set_ylim(0, 1)
axs[1, 0].grid(axis='y')

# Plot F1 Score
df_metrics['F1 Score'].plot(kind='bar', ax=axs[1, 1], color='orange')
axs[1, 1].set_title('Model F1 Score')
axs[1, 1].set_ylim(0, 1)
axs[1, 1].grid(axis='y')

# Adjust the layout for a neat appearance
plt.tight_layout()
plt.show()

Testing the ML Model

Using data of a non-existent person (made-up figures) I will test if this model has the capacity to identify if they are Diabetic or Non-Diabetic.

# Replace 'best_rf' with your actual trained model variable
with open('best_model.pkl', 'wb') as file:
    pickle.dump(best_rf, file)

from sklearn.preprocessing import StandardScaler

# Load model
with open('best_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Example fake person data
new_person = np.array([[2, 130, 80, 25, 100, 28.5, 0.5, 32]])

# Scale data using original dataset scaler
scaler = StandardScaler().fit(X)  # Replace 'X' with your original training features
new_person_scaled = scaler.transform(new_person)

# Prediction
prediction = loaded_model.predict(new_person_scaled)

# Result
print("Diabetic" if prediction[0] == 1 else "Non-Diabetic")

Diabetic

/usr/local/lib/python3.11/dist-packages/sklearn/utils/validation.py:2739: UserWarning:

X does not have valid feature names, but StandardScaler was fitted with feature names

Conclusion

In this project, I successfully built and evaluated machine learning models; - Logistic Regression - Random Forest - Support Vector Machine (SVM) to predict diabetes using medical features.

Through careful data cleaning, feature scaling, and detailed analysis, the Random Forest model emerged as the best performer, balancing accuracy, precision, recall, and F1 score effectively. The ROC-AUC value of 0.82 particularly demonstrated that the model could reliably distinguish between diabetic and non-diabetic individuals, highlighting the practical value of my approach.

Additionally, a real-world simulation confirmed the model’s effectiveness in accurately classifying new, unseen data. While there remains potential for future improvements through advanced techniques and further data collection, this project clearly achieved its primary objective: developing an effective, interpretable, and reliable model for diabetes prediction.

References:

Machine Learning in Healthcare

Deo, R.C., 2015. Machine learning in medicine. Circulation, 132(20), pp.1920–1930.https://pubmed.ncbi.nlm.nih.gov/26572668/
Jiang, F., Jiang, Y., Zhi, H. et al., 2017. Artificial intelligence in healthcare: Past, present and future. Stroke and Vascular Neurology, 2(4), pp.230–243. https://svn.bmj.com/content/2/4/230
Beam, A.L. and Kohane, I.S., 2018. Big data and machine learning in health care. JAMA, 319(13), pp.1317–1318. https://pubmed.ncbi.nlm.nih.gov/29532063/
Ravi, D., Wong, C., Deligianni, F. et al., 2017. Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics, 21(1), pp.4–21. https://pubmed.ncbi.nlm.nih.gov/28055930/

Model Evaluation & Metrics

Saito, T. and Rehmsmeier, M., 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), p.e0118432. https://pubmed.ncbi.nlm.nih.gov/25738806/
Chicco, D. and Jurman, G., 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, p.6. https://pubmed.ncbi.nlm.nih.gov/31898477/
Japkowicz, N. and Stephen, S., 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), pp.429–449. https://dl.acm.org/doi/10.5555/1293951.1293954

Algorithms Used

Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression. 3rd ed. Hoboken, NJ: Wiley. https://dl.icdst.org/pdfs/files4/7751d268eb7358d3ca5bd88968d9227a.pdf
Breiman, L., 2001. Random forests. Machine Learning, 45, pp.5–32. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297. http://dx.doi.org/10.1007/BF00994018
Pedregosa, F., Varoquaux, G., Gramfort, A. et al., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, pp.2825–2830. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

Diabetes-Specific ML Studies

Kavakiotis, I., Tsave, O., Salifoglou, A. et al., 2017. Machine learning and data mining methods in diabetes research. Computational and Structural Biotechnology Journal, 15, pp.104–116. https://doi.org/10.1016/j.csbj.2016.12.005
Sisodia, D. and Sisodia, D.S., 2018. Prediction of diabetes using classification algorithms. Procedia Computer Science, 132, pp.1578–1585. https://doi.org/10.1016/j.procs.2018.05.122
Kaur, H. and Kumari, V. (2022), “Predictive modelling and analytics for diabetes using a machine learning approach”, Applied Computing and Informatics, Vol. 18 No. 1/2, pp. 90-100. https://doi.org/10.1016/j.aci.2018.12.004
Patel, P.D. and Prajapati, H.B., 2019. A comparative study of classification algorithms for prediction of diabetes. International Journal of Computer Applications, 177(38), pp.1–6. https://www.researchgate.net/publication/328020082_A_Comparative_Analysis_on_the_Evaluation_of_Classification_Algorithms_in_the_Prediction_of_Diabetes
Naz, H., Ahuja, S. Deep learning approach for diabetes prediction using PIMA Indian dataset. J Diabetes Metab Disord 19, 391–403 (2020). https://doi.org/10.1007/s40200-020-00520-5

Hyperparameter Tuning

Bergstra, J. and Bengio, Y., 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(1), pp.281–305. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
Claesen, M. and De Moor, B., 2015. Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127.

Handling Imbalanced Data

H. He and E. A. Garcia, “Learning from Imbalanced Data,” in IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, Sept. 2009, doi: 10.1109/TKDE.2008.239.
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, pp.321–357.https://doi.org/10.1613/jair.953

Model Explainability

Ribeiro, M.T., Singh, S. and Guestrin, C., 2016. “Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA: ACM, pp.1135–1144. https://dl.acm.org/doi/10.1145/2939672.2939778
Lundberg, S.M. and Lee, S.-I., 2017. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems (NeurIPS), pp.4765–4774. https://papers.nips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

Ethics & AI in Healthcare

Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2). https://doi.org/10.1177/2053951716679679 (Original work published 2016)
Price WN 2nd, Cohen IG. Privacy in the age of medical big data. Nat Med. 2019 Jan;25(1):37-43. doi: 10.1038/s41591-018-0272-7. Epub 2019 Jan 7. PMID: 30617331; PMCID: PMC6376961.
Vayena E, Blasimme A, Cohen IG (2018) Machine learning in medicine: Addressing ethical challenges. PLoS Med 15(11): e1002689. https://doi.org/10.1371/journal.pmed.1002689

General ML Resource

Géron, A., 2019. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. Sebastopol, CA: O’Reilly Media. https://shashwatwork.github.io/assets/files/ml_ebook.pdf
Nelson RG, Knowler WC, Kretzler M, Lemley KV, Looker HC, Mauer M, Mitch WE, Najafian B, Bennett PH. Pima Indian Contributions to Our Understanding of Diabetic Kidney Disease. Diabetes. 2021 Aug;70(8):1603-1616. doi: 10.2337/dbi20-0043. Epub 2021 Jul 20. PMID: 34285119; PMCID: PMC8385607.

Data Set: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database