Model & Loss
For inputs \(\mathbf{x}_i\in\mathbb{R}^d\), labels \(y_i\in\{0,1\}\), parameters \(\mathbf{w}, b\):
\[ \hat{p}_i = \sigma(\mathbf{w}^\top \mathbf{x}_i + b), \quad \sigma(z)=\frac{1}{1+e^{-z}}. \]
Negative log-likelihood (binary cross-entropy):
\[ \mathcal{L}(\mathbf{w},b) = -\sum_{i=1}^n\big[ y_i\log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\big]. \]
Gradients
Let \(\mathbf{X}\in\mathbb{R}^{n\times d}\) and \(\hat{\mathbf{p}}=\sigma(\mathbf{X}\mathbf{w}+b\mathbf{1})\). Then
\[ \nabla_{\mathbf{w}}\mathcal{L} = \mathbf{X}^\top(\hat{\mathbf{p}}-\mathbf{y}), \qquad \partial_b\mathcal{L}= \mathbf{1}^\top(\hat{\mathbf{p}}-\mathbf{y}). \]
Gradient descent update with step-size \(\eta\):
\[ \mathbf{w} \leftarrow \mathbf{w} - \eta\, \nabla_{\mathbf{w}}\mathcal{L}, \qquad b \leftarrow b - \eta\, \partial_b\mathcal{L}. \]
Worked Example (Tiny Dataset)
import numpy as np
X = np.array([[0.,0.], [0.,1.], [1.,0.], [1.,1.]])
y = np.array([0., 0., 0., 1.]) # AND gate
w = np.zeros(2); b = 0.0
eta = 0.5
for t in range(10):
z = X @ w + b
p = 1/(1+np.exp(-z))
grad_w = X.T @ (p - y)
grad_b = np.sum(p - y)
w -= eta*grad_w
b -= eta*grad_b
print(t+1, w, b)
Weights move to separate the positive example \((1,1)\) from others.
Regularization (Optional)
Add L2: \(\mathcal{L}_\text{reg} = \mathcal{L} + \tfrac{\lambda}{2}\lVert\mathbf{w}\rVert^2\) ⇒ gradient adds \(\lambda\mathbf{w}\). Helps generalization & conditioning.