Training Your First ML Model with Scikit-learn

Machine learning can feel intimidating, but the core mechanics are surprisingly approachable. In this guide, I'll walk you through training a disease prediction model step by step using Python's Scikit-learn library — the same workflow I used in my NephroSense project.

Setting Up Your Environment

We'll need Python 3.10+ and a few libraries:

pip install scikit-learn pandas numpy matplotlib seaborn jupyter

Create a virtual environment first:

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate     # Windows

Loading and Exploring the Dataset

We'll use the Pima Indians Diabetes Dataset — a classic for binary classification. It contains 8 medical features and a target column (1 = diabetic, 0 = not).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('diabetes.csv')
print(df.shape)        # (768, 9)
print(df.head())
print(df.describe())
print(df['Outcome'].value_counts())

Tip: Always start with df.describe() and df.info() to understand distributions, missing values, and data types before doing anything else.

Data Preprocessing

Raw data is never clean. Zero values in medical features like BloodPressure or BMI are physiologically impossible — they represent missing data. We replace them with the median:

from sklearn.impute import SimpleImputer

cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_zeros] = df[cols_with_zeros].replace(0, np.nan)

imputer = SimpleImputer(strategy='median')
df[cols_with_zeros] = imputer.fit_transform(df[cols_with_zeros])

# Split features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Train/test split (80/20)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Feature Scaling

Many ML algorithms are sensitive to feature magnitude. We standardize with StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform, never fit on test!

Common Mistake: Never call fit_transform on test data. This would cause data leakage — the model would "see" test statistics during training.

Training Multiple Models

A good ML workflow tries several algorithms and picks the best:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    preds = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, preds)
    results[name] = acc
    print(f"{name}: {acc:.4f}")

# Gradient Boosting: 0.8052 (usually wins here)

Evaluating Properly

Accuracy alone isn't enough for imbalanced medical datasets. Always check precision, recall, and the confusion matrix:

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

best_model = models['Gradient Boosting']
preds = best_model.predict(X_test_scaled)

print(classification_report(y_test, preds))

cm = confusion_matrix(y_test, preds)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Saving and Loading the Model

import joblib

# Save
joblib.dump(best_model, 'diabetes_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

# Load later
model = joblib.load('diabetes_model.pkl')
scaler = joblib.load('scaler.pkl')

PythonScikit-learnMachine LearningData ScienceClassification