Data Leakage¶

What is Data Leakage?¶

Data leakage occurs when information outside the training set sneaks into the model during the learning process, causing it to perform unrealistically well in evaluation but fail in production.

The Core Principle¶

At evaluation time we are simulating a real prediction scenario. For that simulation to be valid, two things must hold:

The model must not have learned from evaluation data
Every feature must reflect only information that would be available at the moment a real prediction is needed

Every form of leakage violates at least one of these two conditions.

Form 1: Target Leakage¶

The feature directly encodes the target, or reflects events that happen after the prediction point.

Examples: - Predicting whether a loan defaults, but including a feature like "missed_payment" — which only exists because the default already happened - Predicting hospital readmission using a feature collected after discharge - Any field that is a downstream consequence of what you're trying to predict

Why it's subtle: these features have high predictive power, so the model eagerly uses them. The problem only becomes obvious when the model fails on new data where those future values don't exist yet.

Timeline:
  [Prediction point] -----> [Event you're predicting]
         ↑                          ↑
  features must live here     target lives here

  Target leakage = a feature that lives here ----^

Form 2: Train-Test Contamination¶

Test data influences the training process, breaking the independence between splits.

Preprocessing Before Splitting¶

Applying transformations to the full dataset before splitting leaks test statistics into training:

Fitting a scaler on all data → the mean and std include test rows
Imputing missing values using the full dataset mean
Encoding categories based on frequencies across all rows

Fix: always split first, then fit transformations only on training data, and apply (transform only) to test data.

Wrong:                          Right:
  full data                       full data
      ↓                               ↓
  normalize ← leaks test!         train / test split
      ↓                               ↓           ↓
  train / test split          fit scaler      apply scaler
                              on train only   to test only

Other Common Sources¶

Source	Why it leaks
Splitting time-ordered data randomly	Future rows end up in training set
Tuning hyperparameters on test scores	Test set guides model decisions
Duplicate rows across splits	Model has effectively seen test data
Feature engineering informed by test patterns	Test distribution shapes the features

Detecting Leakage¶

Model accuracy is suspiciously high — especially near perfect
Performance drops sharply when moving from evaluation to production
A feature has near-perfect correlation with the target
Feature importance shows a variable that shouldn't logically exist at prediction time

Key Rules¶

Split before you preprocess — fit all transformers on training data only
Check the timeline — every feature must be available before the prediction moment
Never touch the test set until final evaluation — no hyperparameter tuning against it
For time-series data — always split chronologically, never randomly