Lecture 26 - Logistic Regression

1. What & Why

Logistic Regression is a parametric model for binary classification (can be extended to multinomial). It models the probability that the target belongs to class 1 given input features. It's simple, fast, and interpretable — a frequent first-choice baseline for classification tasks.

2. Mathematical formulation

Start with a linear combination (score):

\( z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n \)

Pass the score through the sigmoid (logistic) function to obtain a probability:

\( \sigma(z) = \dfrac{1}{1 + e^{-z}} \),\quad so \quad \( P(Y=1 \mid X) = \sigma(z) \).

Decision rule (default threshold 0.5):

\( \hat{y} = \begin{cases} 1 & \text{if } \sigma(z) \ge 0.5 \\ 0 & \text{otherwise} \end{cases} \)

Log-loss (cross-entropy) objective

\( J(\beta) = -\dfrac{1}{m}\sum_{i=1}^m \big[ y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)}) \big] \)

We minimize \(J(\beta)\) — usually via numerical optimization (Gradient Descent, L-BFGS, or other solvers). Regularization (L1 / L2) is commonly added: \( J_{reg} = J + \lambda \| \beta \| \).

3. Interpretation

Each coefficient \(\beta_j\) represents the log-odds change per unit increase in \(x_j\):

\( \log\frac{P(Y{=}1)}{P(Y{=}0)} = \beta_0 + \sum_j \beta_j x_j \)

Exponentiating \(\beta_j\) gives odds ratio: \( e^{\beta_j} \).
Coefficients should be interpreted with feature scaling and collinearity in mind.

4. Decision boundary & visualization

For two features, logistic regression produces a linear decision boundary (a line). For more features, the boundary is a hyperplane. You can add polynomial/interactions to get nonlinear boundaries.

Plot idea (Python): Fit model on 2 features and plot probability contour with scatter of classes.

5. Practical considerations

Scaling: Standardize features before regularized logistic regression (L2/L1).
Imbalanced data: use class_weight, resampling, or focus on precision/recall/PR-AUC.
Multicollinearity: inflate variances — consider PCA or remove correlated cols.
Feature selection: L1 (Lasso) can produce sparse coefficients.

6. Hands-on Example A — Diabetes classification (Pima dataset)

Steps below include data preparation, training, evaluation, calibration, and interpretation. Copy/paste and run in a Python environment (Jupyter/Colab).

# 1) Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             roc_auc_score, confusion_matrix, classification_report, precision_recall_curve)
import matplotlib.pyplot as plt

# 2) Load data (example: Pima Indians Diabetes CSV)
data = pd.read_csv('diabetes.csv')   # columns include: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, Age, Outcome

X = data.drop('Outcome', axis=1)
y = data['Outcome']

# 3) Split (stratified to preserve class ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# 4) Preprocessing pipeline: scale features (fit on train only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# 5) Baseline model
clf = LogisticRegression(max_iter=1000, solver='liblinear')   # liblinear works well for small problems
clf.fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
y_proba = clf.predict_proba(X_test_scaled)[:,1]

# 6) Evaluation
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1:', f1_score(y_test, y_pred))
print('ROC-AUC:', roc_auc_score(y_test, y_proba))
print('Confusion matrix:\\n', confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# 7) Precision-Recall curve
prec, rec, thr = precision_recall_curve(y_test, y_proba)
plt.plot(rec, prec)
plt.xlabel('Recall'); plt.ylabel('Precision'); plt.title('Precision-Recall Curve'); plt.grid(True)
plt.show()

# 8) Coefficients (interpretable)
coef_df = pd.DataFrame({'feature': X.columns, 'coef': clf.coef_[0]})
coef_df.sort_values('coef', ascending=False, inplace=True)
print(coef_df)

Notes & tips:

If dataset has missing/inaccurate zeros in physiological fields (Glucose, BloodPressure, BMI), handle them with domain-aware imputation (median, KNN, or model-based).
Use class_weight='balanced' or resampling if positives are rare.
To get calibrated probabilities, use sklearn.calibration.CalibratedClassifierCV.

7. Hands-on Example B — Sales: Predicting High vs Low sales

Use logistic regression to predict if next-month sales will be high (1) or low (0) based on marketing & seasonal features.

# sample synthetic example
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

data = pd.DataFrame({
    'TV': [230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2],
    'Radio': [37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6],
    'Newspaper': [69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6],
    'Season': ['Festive','Festive','Off','Off','Festive','Off','Off','Festive'],
    'HighSales': [1,1,0,0,1,0,0,1]
})

X = data.drop('HighSales', axis=1)
y = data['HighSales']

# Preprocessing + model pipeline
numeric_features = ['TV','Radio','Newspaper']
categorical_features = ['Season']

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_features)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('clf', LogisticRegression(class_weight='balanced', max_iter=500))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
pipe.fit(X_train, y_train)
print('Test accuracy:', pipe.score(X_test, y_test))

Extend: Use probabilities to rank customers likely to buy, or set threshold to balance precision vs recall according to business needs.

8. Regularization & Hyperparameter tuning

Regularization controls overfitting. In scikit-learn logistic regression:

C is inverse regularization strength (smaller C → stronger regularization)
penalty = 'l2' (default) or 'l1' (sparse), or 'elasticnet'

# Grid search example
from sklearn.model_selection import GridSearchCV

param_grid = {
  'clf__C': [0.01, 0.1, 1, 10],
  'clf__penalty': ['l2'],
  'clf__solver': ['liblinear']
}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc')
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)

9. Limitations & when to use

Works best when class boundary is approximately linear in features (or transform features).
Probabilities can be miscalibrated; use calibration when probabilities drive decisions.
Not ideal for complex non-linear patterns — consider tree ensembles or neural nets then.

10. Exercises

Run the Pima experiment. Compare performance with & without scaling, and with class weighting.
Calibrate probabilities and compare Brier score before/after calibration.
Introduce polynomial features (degree 2) for two informative columns and observe decision boundary changes.
Report feature coefficients and compute odds ratios; interpret top-3 features for diabetes risk.

Lecture 26: Logistic Regression