Lecture 27 — Bayesian Learning

1. Bayesian Learning — Big Picture

Bayesian learning treats model parameters as random variables and uses probability to represent uncertainty. Learning updates a prior belief about parameters θ to a posterior using observed data D via Bayes' rule:

p(θ | D) = p(D | θ) p(θ) / p(D)

- p(θ): prior (what you believed before seeing data).
- p(D|θ): likelihood (how probable the observed data is under θ).
- p(θ|D): posterior (updated belief).
- p(D): evidence (normalizing constant).

Example (coin toss): prior Beta(α,β) over coin bias θ. Observing heads/tails updates α,β; posterior is Beta(α+heads, β+tails).

2. Six Topics / Models Covered

1) Naïve Bayes (general)

Assumes features are conditionally independent given class: p(y|x) ∝ p(y) ∏ p(xᵢ|y). Fast, works well for text.

2) Multinomial Naïve Bayes

Used for count data (bag-of-words). Likelihood from word counts per class (Laplace smoothing often applied).

3) Bernoulli Naïve Bayes

Binary features (word present/absent). Useful when only occurrence matters.

4) Gaussian Naïve Bayes

Continuous features modeled as Gaussians per class: p(xᵢ|y=c) = N(μ_{c,i}, σ_{c,i}²).

5) Bayesian Networks (Directed Acyclic Graphs)

Represent conditional independence with a DAG. Joint factorizes as ∏ p(Xᵢ | Parents(Xᵢ)). Support structured reasoning and causal models (with care).

6) Bayesian Linear Regression & MAP

Place priors on weights (e.g., Gaussian). Posterior over weights is Gaussian (conjugacy). MAP estimate blends prior and likelihood — equivalent to regularized regression (Ridge).

3. MLE vs MAP vs Full Bayesian

MLE (Maximum Likelihood): choose θ that maximizes p(D|θ). No prior used.
MAP (Maximum A Posteriori): choose θ that maximizes p(θ|D) ∝ p(D|θ)p(θ). Prior acts as regularizer.
Full Bayesian: keep the entire posterior distribution p(θ|D) — enables uncertainty quantification and predictive distribution by integrating over θ.

Example: Gaussian likelihood + Gaussian prior → closed-form posterior (conjugacy). MAP with Gaussian prior = ridge regression.

4. Bayesian Networks (BNs)

BNs encode conditional independencies with a DAG. They are powerful for modeling structured domains (medical diagnosis, fault trees). Inference can be done via variable elimination, belief propagation, or sampling (MCMC).

5. Applications

Spam detection (Naïve Bayes) — simple and effective for text.
Medical diagnosis (Bayesian networks capture symptom-disease relations).
Probabilistic calibration and uncertainty-aware predictions (Bayesian regression).
Hyperparameter tuning via Bayesian optimization.

6. Short Python Examples (sketch)

# Gaussian Naive Bayes (scikit-learn)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
pred = model.predict(X_test)

# Multinomial Naive Bayes for text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
X = CountVectorizer().fit_transform(documents)
clf = MultinomialNB(alpha=1.0)  # Laplace smoothing
clf.fit(X_train, y_train)

Next — interactive Naïve Bayes posterior calculator: enter classes, priors, and likelihoods (categorical) and compute posteriors.

Naïve Bayes Posterior Calculator

This demo computes posterior probabilities using the Naïve Bayes formula for categorical features:

p(y|x) ∝ p(y) ∏ p(x_i | y)

Class names (comma-separated) Class priors (comma-separated, must sum to ≤ 1 or left unnormalized) Number of features

Notes: this is a simple categorical likelihood demo — for continuous features use Gaussian likelihoods and plug μ,σ into the formula (or use GaussianNB).