Lecture 16 — Logistic Regression & Examples

Overview

Logistic regression models probability of class membership using a sigmoid (binary) or softmax (multiclass). Training minimizes cross-entropy loss. Optimization is commonly done via (stochastic) gradient descent or Newton-like methods (IRLS) when data is moderate.

1. Binary Logistic Regression — Theory

Model: for input vector x (including intercept), define linear score z = wᵀx. Probability of class 1:

p(y=1|x) = σ(z) = 1 / (1 + e^{-z})

Cross-entropy loss (negative log-likelihood) for dataset {(xᵢ, yᵢ)}:

L(w) = - Σ [ yᵢ log σ(wᵀxᵢ) + (1-yᵢ) log(1 - σ(wᵀxᵢ)) ]

Gradient:

∇L(w) = Σ (σ(wᵀxᵢ) - yᵢ) xᵢ

Use this gradient in GD/SGD updates: w ← w - η ∇L(w).

2. Multiclass: Softmax & Cross-Entropy

For K classes, param matrix W (K×p) gives scores z_k = w_kᵀ x. Softmax:

p(y=k|x) = exp(z_k) / Σ_j exp(z_j)

Loss (cross-entropy):

L(W) = - Σ_i Σ_k 1{yᵢ=k} log p(yᵢ=k | xᵢ)

Gradient per class: ∇_{w_k} = Σ_i (p(y=k|xᵢ) - 1{yᵢ=k}) xᵢ

3. Optimization Methods

Batch Gradient Descent: compute gradient over full dataset each update. Good for small datasets.
Stochastic Gradient Descent (SGD): update per-sample — noisy but scalable.
Mini-batch SGD: update on small batches (typical in DL).
Newton-Raphson / IRLS (Iteratively Reweighted Least Squares): uses Hessian for faster local convergence. Each iteration solves a weighted least-squares problem. Good for moderate-size problems.
Regularized optimization: add λ||w||₂² (L2) or λ||w||₁ (L1). L2 is common (ridge-like), L1 leads to sparse solutions (coordinate descent often used).

4. Numerical Recipes & Practical Tips

Feature scaling helps convergence (standardize features).
Use learning-rate schedules or adaptive optimizers (Adam) for faster training.
For highly imbalanced classes, use class weights or resampling.
Monitor metrics: accuracy, precision, recall, F1, AUC-ROC for binary classification.

5. Code snippets (NumPy & scikit-learn)

NumPy — simple batch gradient descent (binary)

import numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))

# X: n x p (include column of ones for intercept), y: n (0/1)
def logistic_gd(X, y, lr=0.1, iters=1000):
    n, p = X.shape
    w = np.zeros(p)
    for t in range(iters):
        z = X @ w
        preds = sigmoid(z)
        grad = X.T @ (preds - y)    # shape (p,)
        w -= lr * grad / n
    return w

scikit-learn — quick fit

from sklearn.linear_model import LogisticRegression

# L2 regularized logistic regression (liblinear/saga solvers)
clf = LogisticRegression(penalty='l2', solver='lbfgs', C=1.0, max_iter=1000)
clf.fit(X_train, y_train)
print(clf.coef_, clf.intercept_)

6. Newton-Raphson / IRLS (brief)

Newton update: w ← w - H^{-1} g, where g is gradient and H is Hessian. For logistic regression the Hessian is:

H = Xᵀ R X, where R is diagonal matrix with rᵢ = σ(zᵢ)(1-σ(zᵢ))

IRLS solves at each step: (Xᵀ R X) Δw = Xᵀ (y - p) and updates w ← w + Δw. Converges fast when n is moderate; expensive for large p/n because of matrix solves.

7. Worked numeric example (small) — classification boundary

Toy dataset:
X = [[1, 0.5], [1, 2.5], [1, 1.0], [1, 3.0]]  # first column intercept
y = [0, 1, 0, 1]

Run logistic_gd and inspect w (intercept & slope). Predict probabilities via sigmoid(X @ w).

8. Interactive Playground — Binary Logistic Regression (GD)

Enter a small 2D toy dataset (x, y) each line: feature_value,label(0/1). This demo fits w0 + w1 * x using batch gradient descent and prints iterations.

Data (x,label) one per line — example below: Learning rate Iterations

9. Other Examples & Extensions

Regularized Logistic Regression: add λ||w||₂² to loss — in code add + λ w to gradient (or use sklearn's C parameter).
Multiclass (Softmax): extend to K classes — use cross-entropy with softmax and optimize with SGD or LBFGS.
Imbalanced data: use class weights in loss (sklearn: class_weight) or focal loss for deep nets.
Probabilistic interpretation: logistic regression is a generalized linear model (GLM) with logit link.

10. Exercises

Implement Newton-Raphson / IRLS for the toy dataset and compare convergence to GD.
Train logistic regression on a real dataset (e.g., Iris binary problem) and compute ROC AUC.
Compare L2 vs L1 regularization; show effect on coefficients for correlated features.