Lecture 22: Data Preparation & Feature Engineering

High-quality data beats clever modeling. Today we systematize data cleaning, splitting, encoding, scaling, handling imbalance/missingness, and building robust feature pipelines.

1) Data Quality Dimensions

DimensionQuestionsRemedies
CompletenessMissingness pattern MCAR/MAR/MNAR?Impute (mean/median/mode), model-based, indicator flags
ConsistencyUnits, duplicated ids?Standardize units, de-duplicate, canonicalize categories
ValiditySchema ranges respected?Clamp, winsorize, domain rules
TimelinessStale or future values?Cut by time, roll-forward features
BiasSampling/label bias?Audit distributions, stratify, fairness metrics

2) Splits & Leakage (Patterns)

Random / Stratified Split – IID data, keep label proportions.
Group Split – keep all samples of an entity together (patients/users).
Time-based Split – train on past, test on future.
Leakage Watchlist

3) Encoding Categorical Variables

4) Scaling & Transformations

5) Missing Values & Outliers

Imputation
Outliers

6) Class Imbalance

7) Feature Engineering Patterns

Numeric
Categorical
Datetime
Text (NLP)
Images
Time-series

8) Safe Pipelines

Principle: Put all preprocessing inside a pipeline so that fitting uses train-only statistics and transforms are applied identically to validation/test.
# Pseudocode (sklearn-style)
Pipeline([
  ("impute", SimpleImputer(strategy="median")),
  ("encode", OneHotEncoder(handle_unknown="ignore")),
  ("scale", StandardScaler(with_mean=False)),
  ("model", LogisticRegression(class_weight="balanced"))
])
Try This: Build two versions of your dataset: (A) raw; (B) with engineered features (ratios, lags, TF-IDF). Compare ROC-AUC via 5-fold CV. Report which features moved the needle most.