Linear Regression is one of the most fundamental and widely used algorithms in machine learning and statistics. It models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a straight line (or hyperplane) through the data.
y = w0 + w1x1 + w2x2 + ... + wnxn + ε
The goal is to estimate coefficients w that minimize the squared error:
J(w) = \(\frac{1}{m}\) ∑ (yi - ŷi)²
| Assumption | Description |
|---|---|
| Linearity | Relationship between predictors and target is linear. |
| Independence | Observations are independent of each other. |
| Homoscedasticity | Constant variance of errors across values of predictors. |
| No multicollinearity | Predictors should not be highly correlated with each other. |
| Normality of errors | Residuals are normally distributed. |
A straight line fitted to data points in 2D (X vs Y).
A hyperplane fitted in higher dimensions.
Although diabetes prediction is a classification problem, linear regression can be used first as a baseline model by predicting continuous probabilities (then thresholded).
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, preds))
We predict monthly sales using advertising spend on TV, radio, and newspaper.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Example Sales Data
sales = pd.DataFrame({
"TV":[230.1,44.5,17.2,151.5,180.8],
"Radio":[37.8,39.3,45.9,41.3,10.8],
"Newspaper":[69.2,45.1,69.3,58.5,58.4],
"Sales":[22.1,10.4,9.3,18.5,12.9]
})
X = sales[["TV","Radio","Newspaper"]]
y = sales["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
preds = model.predict(X_test)
print("R² Score:", r2_score(y_test,preds))
By interpreting coefficients, we can see which medium (TV, Radio, Newspaper) contributes most to sales.
Penalizes large coefficients using L2 norm.
J = RSS + λ ∑ w²
Penalizes absolute values of coefficients (L1 norm) → feature selection.
J = RSS + λ ∑ |w|
Combination of Ridge + Lasso.
J = RSS + λ1∑ w² + λ2∑ |w|
# Pipeline with scaling + regularization
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
pipe = Pipeline([
("scale", StandardScaler()),
("ridge", Ridge(alpha=1.0))
])
pipe.fit(X_train, y_train)