Lecture 16 — Logistic Regression & Examples

Lecture 16 — Logistic Regression & Examples

Binary & Multiclass logistic regression, loss, optimization (GD, SGD, Newton/IRLS), regularization, evaluation & examples

Overview

Logistic regression models probability of class membership using a sigmoid (binary) or softmax (multiclass). Training minimizes cross-entropy loss. Optimization is commonly done via (stochastic) gradient descent or Newton-like methods (IRLS) when data is moderate.

1. Binary Logistic Regression — Theory

Model: for input vector x (including intercept), define linear score z = wᵀx. Probability of class 1:

p(y=1|x) = σ(z) = 1 / (1 + e^{-z})

Cross-entropy loss (negative log-likelihood) for dataset {(xᵢ, yᵢ)}:

L(w) = - Σ [ yᵢ log σ(wᵀxᵢ) + (1-yᵢ) log(1 - σ(wᵀxᵢ)) ]
      

Gradient:

∇L(w) = Σ (σ(wᵀxᵢ) - yᵢ) xᵢ
      

Use this gradient in GD/SGD updates: w ← w - η ∇L(w).

2. Multiclass: Softmax & Cross-Entropy

For K classes, param matrix W (K×p) gives scores z_k = w_kᵀ x. Softmax:

p(y=k|x) = exp(z_k) / Σ_j exp(z_j)
      

Loss (cross-entropy):

L(W) = - Σ_i Σ_k 1{yᵢ=k} log p(yᵢ=k | xᵢ)
      

Gradient per class: ∇_{w_k} = Σ_i (p(y=k|xᵢ) - 1{yᵢ=k}) xᵢ

3. Optimization Methods

4. Numerical Recipes & Practical Tips

5. Code snippets (NumPy & scikit-learn)

NumPy — simple batch gradient descent (binary)

import numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))

# X: n x p (include column of ones for intercept), y: n (0/1)
def logistic_gd(X, y, lr=0.1, iters=1000):
    n, p = X.shape
    w = np.zeros(p)
    for t in range(iters):
        z = X @ w
        preds = sigmoid(z)
        grad = X.T @ (preds - y)    # shape (p,)
        w -= lr * grad / n
    return w
      

scikit-learn — quick fit

from sklearn.linear_model import LogisticRegression

# L2 regularized logistic regression (liblinear/saga solvers)
clf = LogisticRegression(penalty='l2', solver='lbfgs', C=1.0, max_iter=1000)
clf.fit(X_train, y_train)
print(clf.coef_, clf.intercept_)
      

6. Newton-Raphson / IRLS (brief)

Newton update: w ← w - H^{-1} g, where g is gradient and H is Hessian. For logistic regression the Hessian is:

H = Xᵀ R X, where R is diagonal matrix with rᵢ = σ(zᵢ)(1-σ(zᵢ))
      

IRLS solves at each step: (Xᵀ R X) Δw = Xᵀ (y - p) and updates w ← w + Δw. Converges fast when n is moderate; expensive for large p/n because of matrix solves.

7. Worked numeric example (small) — classification boundary

Toy dataset:
X = [[1, 0.5], [1, 2.5], [1, 1.0], [1, 3.0]]  # first column intercept
y = [0, 1, 0, 1]

Run logistic_gd and inspect w (intercept & slope). Predict probabilities via sigmoid(X @ w).
      

8. Interactive Playground — Binary Logistic Regression (GD)

Enter a small 2D toy dataset (x, y) each line: feature_value,label(0/1). This demo fits w0 + w1 * x using batch gradient descent and prints iterations.

9. Other Examples & Extensions

10. Exercises

  1. Implement Newton-Raphson / IRLS for the toy dataset and compare convergence to GD.
  2. Train logistic regression on a real dataset (e.g., Iris binary problem) and compute ROC AUC.
  3. Compare L2 vs L1 regularization; show effect on coefficients for correlated features.