Regression Models for Trading¶
Difficulty beginner
Linear Regression¶
Simple Linear Regression¶
Y = β₀ + β₁X + ε
Y = dependent variable (return)
X = independent variable (predictor)
β₀ = intercept
β₁ = slope (coefficient)
ε = error term
does: posits that the response Y is a linear function of one predictor X plus zero-mean noise. The simplest predictive model; every more complex model is a generalization of this form.
Ordinary Least Squares (OLS)¶
where:
Ŷᵢ = β₀ + β₁Xᵢfitted value ·Yᵢ - Ŷᵢresidual ·Ȳ,X̄sample means ·Cov(X,Y), Var(X)sample covariance and variance. does: picks the slope and intercept that minimize the sum of squared residuals. The slope is the covariance scaled by predictor variance — the amount Y moves per unit move in X.
Multiple Regression¶
where:
Ydependent variable ·X₁ … Xₖk predictors ·β_jpartial slope for predictor j (the change in Y per unit change in Xⱼ holding the other predictors fixed) ·εzero-mean error. does: generalizes simple regression to multiple predictors. Each β_j is the partial effect of its predictor — a small but consequential difference from running k separate simple regressions.
Assumptions and Diagnostics¶
OLS Assumptions¶
| # | Assumption | Test | Consequence of Violation |
|---|---|---|---|
| 1 | Linearity | Residual plot | Biased estimates |
| 2 | Independence | Durbin-Watson | Incorrect standard errors |
| 3 | Homoskedasticity | Breusch-Pagan | Inefficient estimates |
| 4 | Normality of errors | Jarque-Bera | Invalid inference |
| 5 | No multicollinearity | VIF | Unstable coefficients |
Robust Regression¶
When OLS Fails¶
Financial data often violates OLS assumptions: - Fat-tailed errors (outliers) - Heteroskedasticity (changing variance) - Autocorrelation (time series)
Solutions¶
Heteroskedasticity-Consistent (HAC) Standard Errors: Newey-West and similar estimators adjust the covariance matrix of the OLS coefficients without changing the point estimates. They allow for both heteroskedasticity and autocorrelation up to a chosen lag, which is exactly the structure financial residuals exhibit. Use them whenever you report t-stats on regressions of returns — vanilla OLS standard errors will be too small and you will see spurious "significance".
Robust Regression (Huber, RANSAC): Robust estimators down-weight or ignore observations far from the fit, so a single fat-tailed shock does not dominate the coefficients. Huber regression blends squared loss near zero with absolute loss in the tails; RANSAC iteratively fits on inlier subsets and discards outliers. Useful when ten observations in a thousand are responsible for half the OLS coefficient — common in event-window regressions and small-sample cross-sectionals.
Weighted Least Squares:
where:
Wᵢweight on observation i ·λdecay constant — values near 1 mean slow decay, values near 0 mean only the most recent points matter. does: down-weights stale observations so the regression tracks current regime. Common in adaptive beta/factor models where recent behaviour dominates.
Regularization¶
Ridge Regression (L2)¶
Minimize: Σ(Yᵢ - Ŷᵢ)² + λΣβⱼ²
Shrinks coefficients but doesn't eliminate them
Good when many predictors are relevant
where:
λregularization strength (chosen by cross-validation) ·Σβⱼ²L2 penalty on coefficient magnitudes · larger λ = more shrinkage toward zero. does: adds quadratic penalty to OLS. Stabilizes estimates under multicollinearity — the standard fix when correlated factors make plain OLS coefficients explode.
Lasso (L1)¶
Minimize: Σ(Yᵢ - Ŷᵢ)² + λΣ|βⱼ|
Can set coefficients to zero (feature selection)
Good when few predictors are relevant
where:
λregularization strength ·Σ|βⱼ|L1 penalty on absolute coefficient values · the non-differentiable corner at zero is what produces exact sparsity. does: simultaneously fits and selects features by zeroing weak ones. Preferred when you suspect only a handful of predictors matter — common for factor-zoo screening.
Elastic Net¶
where:
λ₁L1 weight (sparsity) ·λ₂L2 weight (shrinkage) · both penalties applied simultaneously. does: blends Lasso's variable selection with Ridge's stability under correlated predictors. Default choice when groups of features are highly correlated and you want sparse-but-stable coefficients.
Logistic Regression¶
Binary Classification¶
where:
P(Y=1|X)predicted probability of the positive class · linear scoreβ₀ + β₁X₁ + …passed through the sigmoid ·β_jlog-odds change per unit ofX_j. does: linear model mapped through a sigmoid to produce probabilities. Standard binary classifier for up/down direction, trade/no-trade, default/no-default.
Cross-Validation for Time Series¶
Why Standard CV Fails¶
Random shuffling destroys temporal structure. K-fold CV trains on data from the future and tests on data from the past — a leakage so severe that even a degenerate model can score well. Use expanding-window or rolling-window CV instead: at each step, train on [t₀, t] and test on (t, t+h], never letting any later observation enter the training set. The resulting score reflects what the model would have actually produced in walk-forward operation.
Non-Linear Regression¶
Polynomial Regression¶
Y = β₀ + β₁X + β₂X² + ... + βₖXᵏ + ε
Captures non-linear relationships
Risk of overfitting with high degree
where:
Xsingle predictor expanded into powersX, X², …, Xᵏ·kpolynomial degree · still linear in the coefficientsβ_j(and so still fit by OLS). does: captures curvature via power basis expansion. Degree > 3 typically overfits — prefer splines or kernel methods for serious nonlinearity.
Kernel Regression¶
where: local-weighted average where weights come from a kernel function (Gaussian, Epanechnikov) over a bandwidth
h· bandwidth controls bias-variance tradeoff. does: non-parametric smoothing — lets the data shape the function instead of imposing a global form. Used for nonlinear feature engineering and visual diagnostics of curvature in y vs x.
Practical Applications¶
Key Metrics¶
| Metric | Formula | Interpretation |
|---|---|---|
| R² | 1 - SS_res/SS_tot | Variance explained |
| Adjusted R² | 1 - (1-R²)(n-1)/(n-k-1) | Penalizes extra variables |
| RMSE | √MSE | Prediction error in original units |
| Information Coefficient | Corr(actual, predicted) | Signal quality (-1 to 1) |
| Hit Rate | Correct predictions / Total | Directional accuracy |
Key Formulas Reference¶
OLS: β = (X'X)⁻¹X'y
R²: 1 - Σ(y-ŷ)²/Σ(y-ȳ)²
Ridge: β = (X'X + λI)⁻¹X'y
Logistic: P = 1/(1+e^(-Xβ))
IC: Corr(y, ŷ)
Next Steps¶
- Monte Carlo Methods — Simulation techniques
- Factor Investing — Regression application
- Machine Learning — Advanced modeling