Lecture 22: Data Preparation & Feature Engineering
High-quality data beats clever modeling. Today we systematize data cleaning, splitting, encoding, scaling,
handling imbalance/missingness, and building robust feature pipelines.
Random / Stratified Split – IID data, keep label proportions. Group Split – keep all samples of an entity together (patients/users). Time-based Split – train on past, test on future.
Leakage Watchlist
Fit scalers/encoders on full data → don’t.
Features computed using future info.
Duplicates of test rows in train.
3) Encoding Categorical Variables
One-hot – safe, may increase dimensionality.
Ordinal – for true order (small<medium<large).
Target / Mean Encoding – powerful, but use CV to avoid leakage.
Hashing – for high-cardinality features (stable width).
Principle: Put all preprocessing inside a pipeline so that fitting uses train-only statistics and transforms are applied identically to validation/test.
Try This: Build two versions of your dataset: (A) raw; (B) with engineered features (ratios, lags, TF-IDF). Compare ROC-AUC via 5-fold CV. Report which features moved the needle most.