Skip to content

Probability Theory for Trading

Difficulty beginner

Fundamentals

Basic Definitions

Sample Space (Ω) — Set of all possible outcomes

Coin flip: Ω = {Heads, Tails}
Stock price tomorrow: Ω = [0, ∞)

Event (E) — Subset of sample space

E = {Stock price > $100 tomorrow}

Probability (P) — Measure of likelihood

0 ≤ P(E) ≤ 1
P(Ω) = 1
P(∅) = 0

where: P(E) probability of event E · Ω sample space (all outcomes) · empty set (no outcomes). does: the three Kolmogorov axioms — every probability lies in [0,1], the entire sample space has probability 1, and the impossible event has probability 0.

Probability Rules

Addition Rule:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
If mutually exclusive: P(A ∪ B) = P(A) + P(B)

where: A ∪ B union (A or B or both) · A ∩ B intersection (both A and B) · "mutually exclusive" means A and B cannot occur together. does: computes probability of "either event" — subtracting the overlap prevents double-counting outcomes shared by both events.

Multiplication Rule:

P(A ∩ B) = P(A) × P(B|A)
If independent: P(A ∩ B) = P(A) × P(B)

where: P(B|A) probability of B given A occurred · independence means A's occurrence carries no information about B. does: probability of two events both occurring. Independence collapses the conditional into the marginal — most asset-return models assume this and lose information.

Bayes' Theorem:

P(A|B) = P(B|A) × P(A) / P(B)

Posterior = Likelihood × Prior / Evidence

where: P(A) prior belief about A · P(B|A) likelihood of observing B if A is true · P(B) total probability of B (normalizing constant) · P(A|B) posterior — updated belief after seeing B. does: the mechanical rule for updating beliefs with new evidence. The single most useful equation in quant finance for sequential decision-making.

Conditional Probability

Definition

P(A|B) = P(A ∩ B) / P(B)

Probability of A given that B has occurred

where: P(A|B) conditional probability of A given B · P(A ∩ B) joint probability of both A and B · P(B) marginal probability of B (must be > 0). does: restricts the sample space to outcomes consistent with B, then asks how much of that restricted space is also in A. Backbone of Bayesian inference.

Random Variables

Discrete Random Variables

Probability Mass Function (PMF):

p(x) = P(X = x)

Examples: Number of winning trades in N attempts

where: X discrete random variable · x specific value · p(x) probability X takes that exact value · for discrete RVs, Σ p(x) = 1 over all possible values. does: assigns a probability to every possible outcome of a discrete RV. The discrete analogue of a probability density function.

Binomial Distribution:

P(X = k) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ

n = number of trials
k = number of successes
p = probability of success

Mean: np
Variance: np(1-p)

where: C(n,k) binomial coefficient (n choose k) = n!/(k!(n−k)!) · p^k probability of k specific successes · (1−p)^(n−k) probability of n−k specific failures. does: probability of exactly k wins in n independent trades, each with win probability p. The natural distribution for "in N trades, how often will I win X times?"

Continuous Random Variables

Probability Density Function (PDF):

P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx

f(x) ≥ 0
∫₋∞^∞ f(x)dx = 1

where: f(x) density at x (not a probability — densities can exceed 1) · ∫ₐᵇ f(x)dx area under the density between a and b · the two constraints are non-negativity and total area = 1. does: continuous analogue of the PMF — probability is the integral of the density over a range, never the value at a single point. Used for any continuous return / price / vol model.

Cumulative Distribution Function (CDF):

F(x) = P(X ≤ x) = ∫₋∞ˣ f(t)dt

where: F(x) cumulative distribution function · P(X ≤ x) probability X is at or below x · the integral runs from −∞ to x, accumulating density. does: maps any value x to the probability of being at or below it. Inverse CDF gives quantiles — the function behind VaR, percentile stops, and confidence-interval cut-offs.

Common Distributions in Trading

normal μ −σ symmetric · thin tails log-normal · prices no negative values long right tail
asset returns are often modelled as normal · asset prices as log-normal (cannot go negative)
fat-tailed (student-t, low df) normal (thin tails) μ f(x) extreme losses extreme gains
real return distributions have fatter tails than normal · 4-sigma "100-year" events happen far more often than ~0.006% of the time

Expected Value

Definition

E[X] = Σ xᵢ × p(xᵢ)          (discrete)
E[X] = ∫ x × f(x)dx          (continuous)

where: E[X] expected value (mean) of random variable X · xᵢ discrete outcome · p(xᵢ) probability of that outcome · f(x) probability density function (continuous case). does: probability-weighted average of all possible values. The single most useful summary of a distribution and the foundation of every decision-under-uncertainty framework.

Variance and Moments

Variance

Var(X) = E[(X - μ)²] = E[X²] - (E[X])²

Measures spread of distribution

where: Var(X) variance · μ = E[X] mean · the second form (E[X²] − μ²) is the computational shortcut that avoids the centring step. does: expected squared deviation from the mean. Square root = standard deviation, in the same units as X.

Higher Moments

Moment Formula Meaning
1st E[X] Location (mean)
2nd E[(X-μ)²] Spread (variance)
3rd E[(X-μ)³]/σ³ Asymmetry (skewness)
4th E[(X-μ)⁴]/σ⁴ Tail weight (kurtosis)

Portfolio Variance

Var(Rp) = Σᵢ Σⱼ wᵢwⱼCov(Rᵢ,Rⱼ)

For 2 assets:
Var(Rp) = w₁²σ₁² + w₂²σ₂² + 2w₁w₂ρ₁₂σ₁σ₂

where: Rp portfolio return · wᵢ weight on asset i · Cov(Rᵢ,Rⱼ) covariance between assets i and j · σᵢ standard deviation of asset i · ρ₁₂ correlation between assets 1 and 2. does: generalizes asset variance to a portfolio. The cross term — twice the covariance — is the entire mechanism of diversification.

Law of Large Numbers

Statement

As sample size increases, sample mean converges to population mean:

lim (n→∞) x̄ₙ = μ  (almost surely)

where: x̄ₙ mean of n samples · μ true population mean · "almost surely" means convergence happens with probability 1. does: guarantees that a long enough track record reveals true performance — but says nothing about how long is "enough." In practice, you need many trades before observed mean is a reliable estimate of true mean.

Trading Implication

Your observed win rate will converge to your true win rate as trade count increases. Short-term results are noisy.

Minimum Sample Size

n ≥ (z × σ / E)²

z = z-score for confidence level (1.96 for 95%)
σ = standard deviation
E = margin of error

For win rate estimation:
n ≥ z² × p(1-p) / E²

where: n minimum required sample size · z standard-normal quantile for chosen confidence (1.96 for 95%, 2.576 for 99%) · σ estimated standard deviation · E half-width of the desired confidence interval · p estimated proportion (use 0.5 for the most conservative bound). does: computes the minimum number of trades you need to estimate true win rate or mean return within an error band E at confidence level z. The standard answer to "is my track record long enough yet?"

Central Limit Theorem

Statement

Sample mean of i.i.d. random variables converges to normal distribution:

√n(x̄ₙ - μ) → N(0, σ²)  (in distribution)

where: x̄ₙ sample mean of n i.i.d. observations · μ true mean · σ² true variance · N(0, σ²) normal distribution with mean 0, variance σ² · "→ in distribution" means the CDF converges as n→∞. does: says the scaled, centred sample mean has approximately a normal distribution for large n, regardless of the original distribution's shape (as long as variance is finite). The reason confidence intervals work for non-normal returns.

Trading Application

Even if individual trade returns are not normal, the average return over many trades approaches normality. This enables: - Confidence intervals for strategy performance - Statistical significance testing - VaR calculations

Caveats in Finance

CLT assumptions often violated in markets: - Returns not i.i.d. (volatility clustering) - Infinite variance possible (power laws) - Structural breaks (regime changes)

Markov Chains

Definition

Stochastic process where future depends only on current state:

P(X_{t+1} | X_t, X_{t-1}, ..., X_0) = P(X_{t+1} | X_t)

where: X_t state at time t · the left side conditions on the entire history; the right side conditions only on the current state. does: formalizes "memorylessness" — given today's state, tomorrow is independent of how you got here. Foundation for regime models, hidden Markov models, and discrete dynamic-programming policies.

Monte Carlo Methods

Applications

Application Purpose
Strategy testing Assess robustness to randomness
Risk analysis Estimate tail events
Portfolio planning Range of possible outcomes
Position sizing Optimize Kelly criterion
Options pricing Price complex derivatives

Random Walk Theory

Definition

Price changes are independent and identically distributed:

P_t = P_{t-1} + ε_t
where ε_t ~ i.i.d. with E[ε] = 0

where: P_t price at time t · ε_t random innovation · i.i.d. = independent and identically distributed · zero-mean innovation means no systematic drift in this simplest form. does: the textbook efficient-market model — tomorrow's price is today's plus pure noise. The benchmark every strategy is implicitly trying to beat. Real markets exhibit some predictability (momentum, mean-reversion, volatility clustering) but the random-walk null is hard to reject decisively.

Implications

  1. Prices follow random walk — Past prices don't predict future
  2. Technical analysis ineffective — Patterns are illusory
  3. Active management futile — Can't consistently beat market

Evidence Against Pure Random Walk

Markets exhibit: - Momentum (short to medium term) - Mean reversion (long term) - Volatility clustering - Fat tails - Calendar anomalies

Reality: Markets are not perfectly efficient, but are difficult to beat consistently.

Information Theory

Entropy

H(X) = -Σ p(x) × log₂(p(x))

Measures uncertainty/randomness
Higher entropy = more uncertainty

where: H(X) Shannon entropy of X (bits) · p(x) probability of outcome x · log₂ log base 2 (bits); natural log gives nats; log10 gives hartleys. does: measures average information content in bits. Maximum for a uniform distribution (no predictability), zero for a deterministic one. Used in decision-tree splits, feature selection, and quantifying "how predictable is this market state?"

Mutual Information

I(X;Y) = Σ Σ p(x,y) × log(p(x,y) / (p(x)p(y)))

Measures information shared between X and Y
Better than correlation for non-linear relationships

where: I(X;Y) mutual information between X and Y · p(x,y) joint probability · p(x), p(y) marginal probabilities · zero when X and Y are independent (joint = product of marginals). does: measures any statistical dependence, including nonlinear ones that correlation misses. Essential for screening features where the relationship isn't linear (e.g., volatility regimes, threshold effects).

Key Formulas Reference

Bayes: P(A|B) = P(B|A) × P(A) / P(B)
Expected Value: E[X] = Σ xᵢ × p(xᵢ)
Variance: Var(X) = E[X²] - (E[X])²
Portfolio Variance: w'Σw
Binomial: P(X=k) = C(n,k)pᵏ(1-p)ⁿ⁻ᵏ
EV per trade: WR × AvgWin + (1-WR) × AvgLoss

Next Steps