Probability Theory for Trading¶

Difficulty beginner

Fundamentals¶

Basic Definitions¶

Sample Space (Ω) — Set of all possible outcomes

Coin flip: Ω = {Heads, Tails}
Stock price tomorrow: Ω = [0, ∞)

Event (E) — Subset of sample space

E = {Stock price > $100 tomorrow}

Probability (P) — Measure of likelihood

0 ≤ P(E) ≤ 1
P(Ω) = 1
P(∅) = 0

where: P(E) probability of event E · Ω sample space (all outcomes) · ∅ empty set (no outcomes). does: the three Kolmogorov axioms — every probability lies in [0,1], the entire sample space has probability 1, and the impossible event has probability 0.

Probability Rules¶

Addition Rule:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
If mutually exclusive: P(A ∪ B) = P(A) + P(B)

where: A ∪ B union (A or B or both) · A ∩ B intersection (both A and B) · "mutually exclusive" means A and B cannot occur together. does: computes probability of "either event" — subtracting the overlap prevents double-counting outcomes shared by both events.

Multiplication Rule:

P(A ∩ B) = P(A) × P(B|A)
If independent: P(A ∩ B) = P(A) × P(B)

where: P(B|A) probability of B given A occurred · independence means A's occurrence carries no information about B. does: probability of two events both occurring. Independence collapses the conditional into the marginal — most asset-return models assume this and lose information.

Bayes' Theorem:

P(A|B) = P(B|A) × P(A) / P(B)

Posterior = Likelihood × Prior / Evidence

where: P(A) prior belief about A · P(B|A) likelihood of observing B if A is true · P(B) total probability of B (normalizing constant) · P(A|B) posterior — updated belief after seeing B. does: the mechanical rule for updating beliefs with new evidence. The single most useful equation in quant finance for sequential decision-making.

Conditional Probability¶

Definition¶

P(A|B) = P(A ∩ B) / P(B)

Probability of A given that B has occurred

where: P(A|B) conditional probability of A given B · P(A ∩ B) joint probability of both A and B · P(B) marginal probability of B (must be > 0). does: restricts the sample space to outcomes consistent with B, then asks how much of that restricted space is also in A. Backbone of Bayesian inference.

Random Variables¶

Discrete Random Variables¶

Probability Mass Function (PMF):

p(x) = P(X = x)

Examples: Number of winning trades in N attempts

where: X discrete random variable · x specific value · p(x) probability X takes that exact value · for discrete RVs, Σ p(x) = 1 over all possible values. does: assigns a probability to every possible outcome of a discrete RV. The discrete analogue of a probability density function.

Binomial Distribution:

P(X = k) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ

n = number of trials
k = number of successes
p = probability of success

Mean: np
Variance: np(1-p)

where: C(n,k) binomial coefficient (n choose k) = n!/(k!(n−k)!) · p^k probability of k specific successes · (1−p)^(n−k) probability of n−k specific failures. does: probability of exactly k wins in n independent trades, each with win probability p. The natural distribution for "in N trades, how often will I win X times?"

Continuous Random Variables¶

Probability Density Function (PDF):

P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx

f(x) ≥ 0
∫₋∞^∞ f(x)dx = 1

where: f(x) density at x (not a probability — densities can exceed 1) · ∫ₐᵇ f(x)dx area under the density between a and b · the two constraints are non-negativity and total area = 1. does: continuous analogue of the PMF — probability is the integral of the density over a range, never the value at a single point. Used for any continuous return / price / vol model.

Cumulative Distribution Function (CDF):

F(x) = P(X ≤ x) = ∫₋∞ˣ f(t)dt

where: F(x) cumulative distribution function · P(X ≤ x) probability X is at or below x · the integral runs from −∞ to x, accumulating density. does: maps any value x to the probability of being at or below it. Inverse CDF gives quantiles — the function behind VaR, percentile stops, and confidence-interval cut-offs.

Common Distributions in Trading¶

asset returns are often modelled as normal · asset prices as log-normal (cannot go negative)

real return distributions have fatter tails than normal · 4-sigma "100-year" events happen far more often than ~0.006% of the time

Expected Value¶

Definition¶

E[X] = Σ xᵢ × p(xᵢ)          (discrete)
E[X] = ∫ x × f(x)dx          (continuous)

where: E[X] expected value (mean) of random variable X · xᵢ discrete outcome · p(xᵢ) probability of that outcome · f(x) probability density function (continuous case). does: probability-weighted average of all possible values. The single most useful summary of a distribution and the foundation of every decision-under-uncertainty framework.

Variance and Moments¶

Variance¶

Var(X) = E[(X - μ)²] = E[X²] - (E[X])²

Measures spread of distribution

where: Var(X) variance · μ = E[X] mean · the second form (E[X²] − μ²) is the computational shortcut that avoids the centring step. does: expected squared deviation from the mean. Square root = standard deviation, in the same units as X.

Higher Moments¶

Moment	Formula	Meaning
1st	E[X]	Location (mean)
2nd	E[(X-μ)²]	Spread (variance)
3rd	E[(X-μ)³]/σ³	Asymmetry (skewness)
4th	E[(X-μ)⁴]/σ⁴	Tail weight (kurtosis)

Portfolio Variance¶

Var(Rp) = Σᵢ Σⱼ wᵢwⱼCov(Rᵢ,Rⱼ)

For 2 assets:
Var(Rp) = w₁²σ₁² + w₂²σ₂² + 2w₁w₂ρ₁₂σ₁σ₂

where: Rp portfolio return · wᵢ weight on asset i · Cov(Rᵢ,Rⱼ) covariance between assets i and j · σᵢ standard deviation of asset i · ρ₁₂ correlation between assets 1 and 2. does: generalizes asset variance to a portfolio. The cross term — twice the covariance — is the entire mechanism of diversification.

Law of Large Numbers¶

Statement¶

As sample size increases, sample mean converges to population mean:

lim (n→∞) x̄ₙ = μ  (almost surely)

where: x̄ₙ mean of n samples · μ true population mean · "almost surely" means convergence happens with probability 1. does: guarantees that a long enough track record reveals true performance — but says nothing about how long is "enough." In practice, you need many trades before observed mean is a reliable estimate of true mean.

Trading Implication¶

Your observed win rate will converge to your true win rate as trade count increases. Short-term results are noisy.

Minimum Sample Size¶

n ≥ (z × σ / E)²

z = z-score for confidence level (1.96 for 95%)
σ = standard deviation
E = margin of error

For win rate estimation:
n ≥ z² × p(1-p) / E²

where: n minimum required sample size · z standard-normal quantile for chosen confidence (1.96 for 95%, 2.576 for 99%) · σ estimated standard deviation · E half-width of the desired confidence interval · p estimated proportion (use 0.5 for the most conservative bound). does: computes the minimum number of trades you need to estimate true win rate or mean return within an error band E at confidence level z. The standard answer to "is my track record long enough yet?"

Central Limit Theorem¶

Statement¶

Sample mean of i.i.d. random variables converges to normal distribution:

√n(x̄ₙ - μ) → N(0, σ²)  (in distribution)

where: x̄ₙ sample mean of n i.i.d. observations · μ true mean · σ² true variance · N(0, σ²) normal distribution with mean 0, variance σ² · "→ in distribution" means the CDF converges as n→∞. does: says the scaled, centred sample mean has approximately a normal distribution for large n, regardless of the original distribution's shape (as long as variance is finite). The reason confidence intervals work for non-normal returns.

Trading Application¶

Even if individual trade returns are not normal, the average return over many trades approaches normality. This enables: - Confidence intervals for strategy performance - Statistical significance testing - VaR calculations

Caveats in Finance¶

CLT assumptions often violated in markets: - Returns not i.i.d. (volatility clustering) - Infinite variance possible (power laws) - Structural breaks (regime changes)

Markov Chains¶

Definition¶

Stochastic process where future depends only on current state:

P(X_{t+1} | X_t, X_{t-1}, ..., X_0) = P(X_{t+1} | X_t)

where: X_t state at time t · the left side conditions on the entire history; the right side conditions only on the current state. does: formalizes "memorylessness" — given today's state, tomorrow is independent of how you got here. Foundation for regime models, hidden Markov models, and discrete dynamic-programming policies.

Monte Carlo Methods¶

Applications¶

Application	Purpose
Strategy testing	Assess robustness to randomness
Risk analysis	Estimate tail events
Portfolio planning	Range of possible outcomes
Position sizing	Optimize Kelly criterion
Options pricing	Price complex derivatives

Random Walk Theory¶

Definition¶

Price changes are independent and identically distributed:

P_t = P_{t-1} + ε_t
where ε_t ~ i.i.d. with E[ε] = 0

where: P_t price at time t · ε_t random innovation · i.i.d. = independent and identically distributed · zero-mean innovation means no systematic drift in this simplest form. does: the textbook efficient-market model — tomorrow's price is today's plus pure noise. The benchmark every strategy is implicitly trying to beat. Real markets exhibit some predictability (momentum, mean-reversion, volatility clustering) but the random-walk null is hard to reject decisively.

Implications¶

Prices follow random walk — Past prices don't predict future
Technical analysis ineffective — Patterns are illusory
Active management futile — Can't consistently beat market

Evidence Against Pure Random Walk¶

Markets exhibit: - Momentum (short to medium term) - Mean reversion (long term) - Volatility clustering - Fat tails - Calendar anomalies

Reality: Markets are not perfectly efficient, but are difficult to beat consistently.

Information Theory¶

Entropy¶

H(X) = -Σ p(x) × log₂(p(x))

Measures uncertainty/randomness
Higher entropy = more uncertainty

where: H(X) Shannon entropy of X (bits) · p(x) probability of outcome x · log₂ log base 2 (bits); natural log gives nats; log10 gives hartleys. does: measures average information content in bits. Maximum for a uniform distribution (no predictability), zero for a deterministic one. Used in decision-tree splits, feature selection, and quantifying "how predictable is this market state?"

Mutual Information¶

I(X;Y) = Σ Σ p(x,y) × log(p(x,y) / (p(x)p(y)))

Measures information shared between X and Y
Better than correlation for non-linear relationships

where: I(X;Y) mutual information between X and Y · p(x,y) joint probability · p(x), p(y) marginal probabilities · zero when X and Y are independent (joint = product of marginals). does: measures any statistical dependence, including nonlinear ones that correlation misses. Essential for screening features where the relationship isn't linear (e.g., volatility regimes, threshold effects).

Key Formulas Reference¶

Bayes: P(A|B) = P(B|A) × P(A) / P(B)
Expected Value: E[X] = Σ xᵢ × p(xᵢ)
Variance: Var(X) = E[X²] - (E[X])²
Portfolio Variance: w'Σw
Binomial: P(X=k) = C(n,k)pᵏ(1-p)ⁿ⁻ᵏ
EV per trade: WR × AvgWin + (1-WR) × AvgLoss

Next Steps¶

Time Series Analysis — Modeling sequential financial data
Regression Models — Predictive modeling
Statistics Basics — Foundational statistics