Machine Learning Backtesting Frameworks¶

Overview¶

Backtesting ML models for trading requires specialized frameworks that account for non-stationarity, look-ahead bias, and the unique challenges of financial data. Standard ML backtesting is insufficient.

Difficulty advanced

Key Challenges¶

1. Look-Ahead Bias¶

Common sources:
- Using future data in feature calculation (e.g., future returns for normalization)
- Data leakage in cross-validation
- Survivorship bias (using only currently existing stocks)
- Using adjusted prices that incorporate future corporate actions

2. Non-Stationarity¶

Financial relationships change over time:
- Regime shifts (bull/bear/crash)
- Structural breaks (policy changes, crises)
- Feature drift (predictive power degrades)
- Solution: Rolling training windows, online learning

3. Overfitting¶

Financial data has low signal-to-noise ratio:
- Many features, few independent observations
- Multiple testing problem (data mining bias)
- Solution: Out-of-sample testing, deflated Sharpe ratio

Walk-Forward Validation¶

The gold standard for ML trading backtests: train on [t_0, t_1], predict on [t_1, t_2], then roll the window forward and refit. Preserves temporal order so the model never sees future data. Two flavors: anchored (training window grows) and rolling (fixed-size training window — better when regimes shift). Used in the validate-and-deploy phase: every parameter choice gets tested out-of-sample on data the model has not touched.

Purged Cross-Validation with Embargo¶

Prevents information leakage from overlapping labels: when a label spans [t, t+h] (e.g., 5-day forward return), training samples within h of any test sample are purged. An additional embargo of e bars after each test fold drops a buffer to defeat serial correlation in features. Used during model selection — every hyperparameter is scored under purged CV so backtest Sharpes aren't inflated by leakage.

Deflated Sharpe Ratio¶

Corrects for multiple testing bias: when you try N strategies and report the best, expected max Sharpe under the null grows with N. DSR adjusts the observed Sharpe down using the number of trials, the variance of trial Sharpes, and the skew/kurtosis of returns. A raw Sharpe of 2.0 from 1000 trials may have a DSR of 0.5 — i.e., no edge. Used at the final selection gate before live deployment.

Combinatorial Purged Cross-Validation¶

Lopez de Prado's method for robust backtesting: split the timeline into N groups, then test every C(N, k) combination of k test groups (with purging and embargo). Yields many independent backtest paths instead of one — producing a distribution of Sharpes from which you can compute a probability the strategy is genuinely profitable. Used as the bar for production strategies: median Sharpe across paths, not the headline Sharpe of one path.

Best Practices Checklist¶

[ ] Walk-forward validation used (not random splits)
[ ] Embargo period applied to prevent leakage
[ ] Combinatorial CV for robustness testing
[ ] Deflated Sharpe Ratio calculated
[ ] Feature stability tested across runs
[ ] Out-of-sample period separate from all training
[ ] Transaction costs included in evaluation
[ ] Multiple market regimes in test data
[ ] No look-ahead bias in feature engineering
[ ] Survivorship bias addressed

References¶

Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
Bailey, D.H. & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management, 40(4), 94-101.
Gu, S., Kelly, B., & Xiu, D. (2020). "Empirical Asset Pricing via Machine Learning." Review of Financial Studies, 33(5), 2223-2273.