Logistic regression models probability of class membership using a sigmoid (binary) or softmax (multiclass). Training minimizes cross-entropy loss. Optimization is commonly done via (stochastic) gradient descent or Newton-like methods (IRLS) when data is moderate.
Model: for input vector x (including intercept), define linear score z = wᵀx. Probability of class 1:
p(y=1|x) = σ(z) = 1 / (1 + e^{-z})
Cross-entropy loss (negative log-likelihood) for dataset {(xᵢ, yᵢ)}:
L(w) = - Σ [ yᵢ log σ(wᵀxᵢ) + (1-yᵢ) log(1 - σ(wᵀxᵢ)) ]
Gradient:
∇L(w) = Σ (σ(wᵀxᵢ) - yᵢ) xᵢ
Use this gradient in GD/SGD updates: w ← w - η ∇L(w).
For K classes, param matrix W (K×p) gives scores z_k = w_kᵀ x. Softmax:
p(y=k|x) = exp(z_k) / Σ_j exp(z_j)
Loss (cross-entropy):
L(W) = - Σ_i Σ_k 1{yᵢ=k} log p(yᵢ=k | xᵢ)
Gradient per class: ∇_{w_k} = Σ_i (p(y=k|xᵢ) - 1{yᵢ=k}) xᵢ
import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))
# X: n x p (include column of ones for intercept), y: n (0/1)
def logistic_gd(X, y, lr=0.1, iters=1000):
n, p = X.shape
w = np.zeros(p)
for t in range(iters):
z = X @ w
preds = sigmoid(z)
grad = X.T @ (preds - y) # shape (p,)
w -= lr * grad / n
return w
from sklearn.linear_model import LogisticRegression
# L2 regularized logistic regression (liblinear/saga solvers)
clf = LogisticRegression(penalty='l2', solver='lbfgs', C=1.0, max_iter=1000)
clf.fit(X_train, y_train)
print(clf.coef_, clf.intercept_)
Newton update: w ← w - H^{-1} g, where g is gradient and H is Hessian. For logistic regression the Hessian is:
H = Xᵀ R X, where R is diagonal matrix with rᵢ = σ(zᵢ)(1-σ(zᵢ))
IRLS solves at each step: (Xᵀ R X) Δw = Xᵀ (y - p) and updates w ← w + Δw. Converges fast when n is moderate; expensive for large p/n because of matrix solves.
Toy dataset:
X = [[1, 0.5], [1, 2.5], [1, 1.0], [1, 3.0]] # first column intercept
y = [0, 1, 0, 1]
Run logistic_gd and inspect w (intercept & slope). Predict probabilities via sigmoid(X @ w).
Enter a small 2D toy dataset (x, y) each line: feature_value,label(0/1). This demo fits w0 + w1 * x using batch gradient descent and prints iterations.