Low-Latency Architecture¶
Overview¶
Low-latency trading systems are designed to minimize the time between receiving market data and sending orders. Every microsecond matters in HFT and competitive execution. This document covers the architecture, hardware, and software techniques for building ultra-low-latency systems.
Difficulty expert
Latency Budget¶
Total Latency = Market Data Receive + Processing + Order Send
Typical Budget for HFT:
- Market Data Receive: 1-5 μs
- Processing: 1-10 μs
- Order Send: 1-5 μs
- Total: 3-20 μs
Typical Budget for Statistical Arbitrage:
- Market Data Receive: 10-100 μs
- Processing: 100-1000 μs
- Order Send: 10-100 μs
- Total: 120-1200 μs
where:
Market Data Receivetime from wire arrival to in-memory parsed tick ·Processingstrategy + risk-check time on the hot path ·Order Sendtime from decision to leaving the NIC. does: the tick-to-trade budget that drives every engineering decision in a competitive HFT stack — kernel bypass, CPU pinning, FPGA acceleration. Optimize the slowest component first; a 1 μs gain in the longest stage matters more than a 5 μs gain in the shortest.
Hardware Architecture¶
Network Interface Cards (NICs)¶
Solarflare/Xilinx (Adaptive):
- OpenOnload: Kernel bypass TCP/IP stack
- EF_VI: Direct NIC access
- Latency: ~0.5-1 μs
Intel:
- DPDK: Data Plane Development Kit
- Kernel bypass
- Latency: ~1-2 μs
Mellanox (NVIDIA):
- VMA (Virtual Machine Accelerator)
- RDMA support
- Latency: ~1-2 μs
Co-Location¶
Data Centers:
- NY4 (Mahwah, NJ): NYSE matching engine
- CH4 (Aurora, IL): CME matching engine
- LD4 (Slough, UK): London financial hub
- TY3 (Tokyo): TSE matching engine
Benefits:
- Direct fiber to matching engine
- Shared network infrastructure
- Reduced physical distance
Processors¶
CPU Selection Criteria:
- High clock speed (> 4 GHz boost)
- Low latency cores (not high core count)
- Large L1/L2 cache
- Support for CPU pinning
Examples:
- Intel Xeon (high frequency variants)
- AMD EPYC (with CPU pinning)
- Overclocked consumer CPUs (for research)
Avoid:
- Power-saving features (C-states, P-states)
- Hyper-threading for critical path
- NUMA crossings
FPGA (Field Programmable Gate Array)¶
Use Cases:
- Market data parsing (binary protocols)
- Order encoding
- Risk checks
- Strategy logic (simple strategies)
Vendors:
- Xilinx (now AMD): Alveo, Virtex UltraScale+
- Intel (Altera): Stratix, Arria
Latency: < 1 μs end-to-end
Power: 10-50W
Complexity: Very high (Verilog/VHDL)
Software Architecture¶
Kernel Bypass¶
Standard kernel path:
NIC → Kernel → TCP/IP stack → Socket → Application
Latency: 10-100 μs
Kernel bypass:
NIC → User-space library → Application
Latency: 0.5-2 μs
Techniques:
1. Solarflare OpenOnload
2. Intel DPDK
3. Mellanox VMA
4. Custom NIC drivers
CPU Pinning and Isolation¶
import os
def isolate_cpu(cpu_id: int):
"""Pin process to specific CPU core."""
os.sched_setaffinity(0, {cpu_id})
# Disable CPU frequency scaling
os.system(f"cpupower frequency-set -g performance")
# Set CPU governor to performance
os.system(f"echo performance > /sys/devices/system/cpu/cpu{cpu_id}/cpufreq/scaling_governor")
def optimize_process():
"""Optimize process for low latency."""
# Set real-time priority
import resource
resource.setrlimit(resource.RLIMIT_MEMLOCK, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))
# Lock memory (prevent swapping)
import ctypes
libc = ctypes.CDLL('libc.so.6')
libc.mlockall(1 | 2) # MCL_CURRENT | MCL_FUTURE
# Set scheduling policy
os.sched_setscheduler(0, os.SCHED_FIFO, os.sched_param(99))
Lock-Free Data Structures¶
import threading
from collections import deque
from ctypes import c_int64
class LockFreeQueue:
"""Single-producer, single-consumer lock-free queue."""
def __init__(self, capacity: int = 1024):
self.buffer = [None] * capacity
self.capacity = capacity
self._head = c_int64(0)
self._tail = c_int64(0)
def push(self, item) -> bool:
"""Push item to queue (producer)."""
tail = self._tail.value
next_tail = (tail + 1) % self.capacity
if next_tail == self._head.value:
return False # Queue full
self.buffer[tail] = item
self._tail.value = next_tail
return True
def pop(self):
"""Pop item from queue (consumer)."""
head = self._head.value
if head == self._tail.value:
return None # Queue empty
item = self.buffer[head]
self._head.value = (head + 1) % self.capacity
return item
class RingBuffer:
"""Fixed-size ring buffer for market data."""
def __init__(self, size: int = 65536):
self.size = size
self.buffer = [None] * size
self.write_idx = 0
self.read_idx = 0
def write(self, data) -> bool:
"""Write to ring buffer."""
next_idx = (self.write_idx + 1) % self.size
if next_idx == self.read_idx:
return False # Buffer full
self.buffer[self.write_idx] = data
self.write_idx = next_idx
return True
def read(self):
"""Read from ring buffer."""
if self.read_idx == self.write_idx:
return None # Buffer empty
data = self.buffer[self.read_idx]
self.read_idx = (self.read_idx + 1) % self.size
return data
Memory Management¶
import mmap
class PreallocatedMemory:
"""Pre-allocate memory to avoid runtime allocation."""
def __init__(self, size_mb: int = 256):
self.size = size_mb * 1024 * 1024
# Pre-allocate and zero-fill
self.memory = bytearray(self.size)
# Lock in physical memory
import ctypes
libc = ctypes.CDLL('libc.so.6')
libc.mlock(ctypes.byref(ctypes.c_char_p(self.memory)), self.size)
def get_buffer(self, offset: int, size: int) -> bytearray:
"""Get a slice of pre-allocated memory."""
return self.memory[offset:offset+size]
class ObjectPool:
"""Pool of reusable objects to avoid GC."""
def __init__(self, factory, size: int = 1000):
self.pool = [factory() for _ in range(size)]
self._index = 0
def acquire(self):
"""Get object from pool."""
if self._index < len(self.pool):
obj = self.pool[self._index]
self._index += 1
return obj
return None # Pool exhausted
def release(self, obj):
"""Return object to pool."""
self._index -= 1
self.pool[self._index] = obj
System Configuration¶
Linux Kernel Tuning¶
# /etc/sysctl.conf
# Disable CPU frequency scaling
kernel.sched_latency_ns = 1000000
kernel.sched_min_granularity_ns = 500000
# Network optimization
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Disable interrupt coalescence
# ethtool -C eth0 rx-usecs 0 tx-usecs 0
# Disable NUMA balancing
kernel.numa_balancing = 0
# Isolate CPU cores
isolcpus=2,3,4,5 nohz_full=2,3,4,5 rcu_nocbs=2,3,4,5
BIOS Settings¶
- Disable C-states (CPU sleep states)
- Disable P-states (CPU frequency scaling)
- Disable Hyper-threading (for critical cores)
- Enable Turbo Boost
- Set power profile to "Performance"
- Disable unused peripherals
- Set PCIe to Gen 4/5
Latency Measurement¶
import time
class LatencyMeter:
"""Measure latency at various points."""
def __init__(self):
self.measurements = {}
def measure(self, name: str):
"""Record a latency measurement."""
self.measurements[name] = self.measurements.get(name, [])
self.measurements[name].append(time.perf_counter_ns())
def get_stats(self, name: str) -> dict:
"""Get latency statistics."""
if name not in self.measurements or len(self.measurements[name]) < 2:
return {}
diffs = []
timestamps = self.measurements[name]
for i in range(1, len(timestamps)):
diffs.append(timestamps[i] - timestamps[i-1])
import numpy as np
return {
'count': len(diffs),
'mean_ns': np.mean(diffs),
'median_ns': np.median(diffs),
'p50_ns': np.percentile(diffs, 50),
'p90_ns': np.percentile(diffs, 90),
'p99_ns': np.percentile(diffs, 99),
'p999_ns': np.percentile(diffs, 99.9),
'max_ns': np.max(diffs),
'min_ns': np.min(diffs),
}
def measure_tick_to_trade(self, market_data_time: int,
order_sent_time: int) -> int:
"""Measure tick-to-trade latency."""
return order_sent_time - market_data_time
# Example usage
meter = LatencyMeter()
# Measure market data processing
md_start = time.perf_counter_ns()
process_market_data(data)
md_end = time.perf_counter_ns()
# Measure order generation
order_start = time.perf_counter_ns()
generate_order(signal)
order_end = time.perf_counter_ns()
print(f"Market data processing: {(md_end - md_start) / 1000:.1f} μs")
print(f"Order generation: {(order_end - order_start) / 1000:.1f} μs")
Checklist¶
- [ ] Latency budget defined for each component
- [ ] Kernel bypass enabled (OpenOnload/DPDK)
- [ ] CPU pinning configured
- [ ] Memory pre-allocated (no runtime allocation)
- [ ] Lock-free data structures used
- [ ] GC disabled or minimized (for C++/Rust)
- [ ] Network tuned (interrupt coalescing disabled)
- [ ] Co-location evaluated for key venues
- [ ] Latency measured end-to-end
- [ ] P99 and P99.9 latency tracked (not just average)
- [ ] System monitored for jitter
- [ ] Backup/failover system tested
References¶
- Barry, M. (2019). Building Low Latency Applications with C++. Packt.
- Siddiqi, H. (2020). Low-Latency Trading Systems. Self-published.
- Solarflare. (2023). "OpenOnload User Guide." Xilinx.
- Intel. (2023). "DPDK Developer Guide." Intel Corporation.