Machine learning can feel intimidating, but the core mechanics are surprisingly approachable. In this guide, I'll walk you through training a disease prediction model step by step using Python's Scikit-learn library — the same workflow I used in my NephroSense project.
Setting Up Your Environment
We'll need Python 3.10+ and a few libraries:
pip install scikit-learn pandas numpy matplotlib seaborn jupyter
Create a virtual environment first:
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
Loading and Exploring the Dataset
We'll use the Pima Indians Diabetes Dataset — a classic for binary classification. It contains 8 medical features and a target column (1 = diabetic, 0 = not).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('diabetes.csv')
print(df.shape) # (768, 9)
print(df.head())
print(df.describe())
print(df['Outcome'].value_counts())
df.describe() and
df.info() to understand distributions, missing values, and data types before doing
anything else.
Data Preprocessing
Raw data is never clean. Zero values in medical features like BloodPressure or
BMI are physiologically impossible — they represent missing data. We replace them
with the median:
from sklearn.impute import SimpleImputer
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_zeros] = df[cols_with_zeros].replace(0, np.nan)
imputer = SimpleImputer(strategy='median')
df[cols_with_zeros] = imputer.fit_transform(df[cols_with_zeros])
# Split features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']
# Train/test split (80/20)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Feature Scaling
Many ML algorithms are sensitive to feature magnitude. We standardize with
StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Only transform, never fit on test!
fit_transform on test data. This would cause data leakage — the
model would "see" test statistics during training.
Training Multiple Models
A good ML workflow tries several algorithms and picks the best:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(probability=True)
}
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
preds = model.predict(X_test_scaled)
acc = accuracy_score(y_test, preds)
results[name] = acc
print(f"{name}: {acc:.4f}")
# Gradient Boosting: 0.8052 (usually wins here)
Evaluating Properly
Accuracy alone isn't enough for imbalanced medical datasets. Always check precision, recall, and the confusion matrix:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
best_model = models['Gradient Boosting']
preds = best_model.predict(X_test_scaled)
print(classification_report(y_test, preds))
cm = confusion_matrix(y_test, preds)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()
Saving and Loading the Model
import joblib
# Save
joblib.dump(best_model, 'diabetes_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
# Load later
model = joblib.load('diabetes_model.pkl')
scaler = joblib.load('scaler.pkl')