Skip to content

Regression Models for Trading

Difficulty beginner

Linear Regression

Simple Linear Regression

Y = β₀ + β₁X + ε

Y = dependent variable (return)
X = independent variable (predictor)
β₀ = intercept
β₁ = slope (coefficient)
ε = error term

does: posits that the response Y is a linear function of one predictor X plus zero-mean noise. The simplest predictive model; every more complex model is a generalization of this form.

Ordinary Least Squares (OLS)

Minimize: Σ(Yᵢ - Ŷᵢ)²

Solution:
β₁ = Cov(X,Y) / Var(X)
β₀ = Ȳ - β₁X̄

where: Ŷᵢ = β₀ + β₁Xᵢ fitted value · Yᵢ - Ŷᵢ residual · Ȳ, sample means · Cov(X,Y), Var(X) sample covariance and variance. does: picks the slope and intercept that minimize the sum of squared residuals. The slope is the covariance scaled by predictor variance — the amount Y moves per unit move in X.

Multiple Regression

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε

where: Y dependent variable · X₁ … Xₖ k predictors · β_j partial slope for predictor j (the change in Y per unit change in Xⱼ holding the other predictors fixed) · ε zero-mean error. does: generalizes simple regression to multiple predictors. Each β_j is the partial effect of its predictor — a small but consequential difference from running k separate simple regressions.

Assumptions and Diagnostics

OLS Assumptions

# Assumption Test Consequence of Violation
1 Linearity Residual plot Biased estimates
2 Independence Durbin-Watson Incorrect standard errors
3 Homoskedasticity Breusch-Pagan Inefficient estimates
4 Normality of errors Jarque-Bera Invalid inference
5 No multicollinearity VIF Unstable coefficients

Robust Regression

When OLS Fails

Financial data often violates OLS assumptions: - Fat-tailed errors (outliers) - Heteroskedasticity (changing variance) - Autocorrelation (time series)

Solutions

Heteroskedasticity-Consistent (HAC) Standard Errors: Newey-West and similar estimators adjust the covariance matrix of the OLS coefficients without changing the point estimates. They allow for both heteroskedasticity and autocorrelation up to a chosen lag, which is exactly the structure financial residuals exhibit. Use them whenever you report t-stats on regressions of returns — vanilla OLS standard errors will be too small and you will see spurious "significance".

Robust Regression (Huber, RANSAC): Robust estimators down-weight or ignore observations far from the fit, so a single fat-tailed shock does not dominate the coefficients. Huber regression blends squared loss near zero with absolute loss in the tails; RANSAC iteratively fits on inlier subsets and discards outliers. Useful when ten observations in a thousand are responsible for half the OLS coefficient — common in event-window regressions and small-sample cross-sectionals.

Weighted Least Squares:

Weight recent observations more heavily
Wᵢ = λⁱ where 0 < λ < 1 (exponential decay)

where: Wᵢ weight on observation i · λ decay constant — values near 1 mean slow decay, values near 0 mean only the most recent points matter. does: down-weights stale observations so the regression tracks current regime. Common in adaptive beta/factor models where recent behaviour dominates.

Regularization

Ridge Regression (L2)

Minimize: Σ(Yᵢ - Ŷᵢ)² + λΣβⱼ²

Shrinks coefficients but doesn't eliminate them
Good when many predictors are relevant

where: λ regularization strength (chosen by cross-validation) · Σβⱼ² L2 penalty on coefficient magnitudes · larger λ = more shrinkage toward zero. does: adds quadratic penalty to OLS. Stabilizes estimates under multicollinearity — the standard fix when correlated factors make plain OLS coefficients explode.

Lasso (L1)

Minimize: Σ(Yᵢ - Ŷᵢ)² + λΣ|βⱼ|

Can set coefficients to zero (feature selection)
Good when few predictors are relevant

where: λ regularization strength · Σ|βⱼ| L1 penalty on absolute coefficient values · the non-differentiable corner at zero is what produces exact sparsity. does: simultaneously fits and selects features by zeroing weak ones. Preferred when you suspect only a handful of predictors matter — common for factor-zoo screening.

Elastic Net

Minimize: Σ(Yᵢ - Ŷᵢ)² + λ₁Σ|βⱼ| + λ₂Σβⱼ²

Combines L1 and L2 penalties

where: λ₁ L1 weight (sparsity) · λ₂ L2 weight (shrinkage) · both penalties applied simultaneously. does: blends Lasso's variable selection with Ridge's stability under correlated predictors. Default choice when groups of features are highly correlated and you want sparse-but-stable coefficients.

Logistic Regression

Binary Classification

P(Y=1|X) = 1 / (1 + e^(-(β₀ + β₁X₁ + ... + βₖXₖ)))

Output: probability between 0 and 1

where: P(Y=1|X) predicted probability of the positive class · linear score β₀ + β₁X₁ + … passed through the sigmoid · β_j log-odds change per unit of X_j. does: linear model mapped through a sigmoid to produce probabilities. Standard binary classifier for up/down direction, trade/no-trade, default/no-default.

Cross-Validation for Time Series

Why Standard CV Fails

Random shuffling destroys temporal structure. K-fold CV trains on data from the future and tests on data from the past — a leakage so severe that even a degenerate model can score well. Use expanding-window or rolling-window CV instead: at each step, train on [t₀, t] and test on (t, t+h], never letting any later observation enter the training set. The resulting score reflects what the model would have actually produced in walk-forward operation.

Non-Linear Regression

Polynomial Regression

Y = β₀ + β₁X + β₂X² + ... + βₖXᵏ + ε

Captures non-linear relationships
Risk of overfitting with high degree

where: X single predictor expanded into powers X, X², …, Xᵏ · k polynomial degree · still linear in the coefficients β_j (and so still fit by OLS). does: captures curvature via power basis expansion. Degree > 3 typically overfits — prefer splines or kernel methods for serious nonlinearity.

Kernel Regression

Smooth non-parametric estimation
No functional form assumption

where: local-weighted average where weights come from a kernel function (Gaussian, Epanechnikov) over a bandwidth h · bandwidth controls bias-variance tradeoff. does: non-parametric smoothing — lets the data shape the function instead of imposing a global form. Used for nonlinear feature engineering and visual diagnostics of curvature in y vs x.

Practical Applications

Key Metrics

Metric Formula Interpretation
1 - SS_res/SS_tot Variance explained
Adjusted R² 1 - (1-R²)(n-1)/(n-k-1) Penalizes extra variables
RMSE √MSE Prediction error in original units
Information Coefficient Corr(actual, predicted) Signal quality (-1 to 1)
Hit Rate Correct predictions / Total Directional accuracy

Key Formulas Reference

OLS: β = (X'X)⁻¹X'y
R²: 1 - Σ(y-ŷ)²/Σ(y-ȳ)²
Ridge: β = (X'X + λI)⁻¹X'y
Logistic: P = 1/(1+e^(-Xβ))
IC: Corr(y, ŷ)

Next Steps