Low-Latency Architecture¶

Overview¶

Low-latency trading systems are designed to minimize the time between receiving market data and sending orders. Every microsecond matters in HFT and competitive execution. This document covers the architecture, hardware, and software techniques for building ultra-low-latency systems.

Difficulty expert

Latency Budget¶

Total Latency = Market Data Receive + Processing + Order Send

Typical Budget for HFT:
- Market Data Receive: 1-5 μs
- Processing: 1-10 μs
- Order Send: 1-5 μs
- Total: 3-20 μs

Typical Budget for Statistical Arbitrage:
- Market Data Receive: 10-100 μs
- Processing: 100-1000 μs
- Order Send: 10-100 μs
- Total: 120-1200 μs

where: Market Data Receive time from wire arrival to in-memory parsed tick · Processing strategy + risk-check time on the hot path · Order Send time from decision to leaving the NIC. does: the tick-to-trade budget that drives every engineering decision in a competitive HFT stack — kernel bypass, CPU pinning, FPGA acceleration. Optimize the slowest component first; a 1 μs gain in the longest stage matters more than a 5 μs gain in the shortest.

Hardware Architecture¶

Network Interface Cards (NICs)¶

Solarflare/Xilinx (Adaptive):
- OpenOnload: Kernel bypass TCP/IP stack
- EF_VI: Direct NIC access
- Latency: ~0.5-1 μs

Intel:
- DPDK: Data Plane Development Kit
- Kernel bypass
- Latency: ~1-2 μs

Mellanox (NVIDIA):
- VMA (Virtual Machine Accelerator)
- RDMA support
- Latency: ~1-2 μs

Co-Location¶

Data Centers:
- NY4 (Mahwah, NJ): NYSE matching engine
- CH4 (Aurora, IL): CME matching engine
- LD4 (Slough, UK): London financial hub
- TY3 (Tokyo): TSE matching engine

Benefits:
- Direct fiber to matching engine
- Shared network infrastructure
- Reduced physical distance

Processors¶

CPU Selection Criteria:
- High clock speed (> 4 GHz boost)
- Low latency cores (not high core count)
- Large L1/L2 cache
- Support for CPU pinning

Examples:
- Intel Xeon (high frequency variants)
- AMD EPYC (with CPU pinning)
- Overclocked consumer CPUs (for research)

Avoid:
- Power-saving features (C-states, P-states)
- Hyper-threading for critical path
- NUMA crossings

FPGA (Field Programmable Gate Array)¶

Use Cases:
- Market data parsing (binary protocols)
- Order encoding
- Risk checks
- Strategy logic (simple strategies)

Vendors:
- Xilinx (now AMD): Alveo, Virtex UltraScale+
- Intel (Altera): Stratix, Arria

Latency: < 1 μs end-to-end
Power: 10-50W
Complexity: Very high (Verilog/VHDL)

Software Architecture¶

Kernel Bypass¶

Standard kernel path:
NIC → Kernel → TCP/IP stack → Socket → Application
Latency: 10-100 μs

Kernel bypass:
NIC → User-space library → Application
Latency: 0.5-2 μs

Techniques:
1. Solarflare OpenOnload
2. Intel DPDK
3. Mellanox VMA
4. Custom NIC drivers

CPU Pinning and Isolation¶

import os

def isolate_cpu(cpu_id: int):
    """Pin process to specific CPU core."""
    os.sched_setaffinity(0, {cpu_id})

    # Disable CPU frequency scaling
    os.system(f"cpupower frequency-set -g performance")

    # Set CPU governor to performance
    os.system(f"echo performance > /sys/devices/system/cpu/cpu{cpu_id}/cpufreq/scaling_governor")


def optimize_process():
    """Optimize process for low latency."""
    # Set real-time priority
    import resource
    resource.setrlimit(resource.RLIMIT_MEMLOCK, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))

    # Lock memory (prevent swapping)
    import ctypes
    libc = ctypes.CDLL('libc.so.6')
    libc.mlockall(1 | 2)  # MCL_CURRENT | MCL_FUTURE

    # Set scheduling policy
    os.sched_setscheduler(0, os.SCHED_FIFO, os.sched_param(99))

Lock-Free Data Structures¶

import threading
from collections import deque
from ctypes import c_int64

class LockFreeQueue:
    """Single-producer, single-consumer lock-free queue."""

    def __init__(self, capacity: int = 1024):
        self.buffer = [None] * capacity
        self.capacity = capacity
        self._head = c_int64(0)
        self._tail = c_int64(0)

    def push(self, item) -> bool:
        """Push item to queue (producer)."""
        tail = self._tail.value
        next_tail = (tail + 1) % self.capacity

        if next_tail == self._head.value:
            return False  # Queue full

        self.buffer[tail] = item
        self._tail.value = next_tail
        return True

    def pop(self):
        """Pop item from queue (consumer)."""
        head = self._head.value

        if head == self._tail.value:
            return None  # Queue empty

        item = self.buffer[head]
        self._head.value = (head + 1) % self.capacity
        return item


class RingBuffer:
    """Fixed-size ring buffer for market data."""

    def __init__(self, size: int = 65536):
        self.size = size
        self.buffer = [None] * size
        self.write_idx = 0
        self.read_idx = 0

    def write(self, data) -> bool:
        """Write to ring buffer."""
        next_idx = (self.write_idx + 1) % self.size
        if next_idx == self.read_idx:
            return False  # Buffer full
        self.buffer[self.write_idx] = data
        self.write_idx = next_idx
        return True

    def read(self):
        """Read from ring buffer."""
        if self.read_idx == self.write_idx:
            return None  # Buffer empty
        data = self.buffer[self.read_idx]
        self.read_idx = (self.read_idx + 1) % self.size
        return data

Memory Management¶

import mmap

class PreallocatedMemory:
    """Pre-allocate memory to avoid runtime allocation."""

    def __init__(self, size_mb: int = 256):
        self.size = size_mb * 1024 * 1024
        # Pre-allocate and zero-fill
        self.memory = bytearray(self.size)
        # Lock in physical memory
        import ctypes
        libc = ctypes.CDLL('libc.so.6')
        libc.mlock(ctypes.byref(ctypes.c_char_p(self.memory)), self.size)

    def get_buffer(self, offset: int, size: int) -> bytearray:
        """Get a slice of pre-allocated memory."""
        return self.memory[offset:offset+size]


class ObjectPool:
    """Pool of reusable objects to avoid GC."""

    def __init__(self, factory, size: int = 1000):
        self.pool = [factory() for _ in range(size)]
        self._index = 0

    def acquire(self):
        """Get object from pool."""
        if self._index < len(self.pool):
            obj = self.pool[self._index]
            self._index += 1
            return obj
        return None  # Pool exhausted

    def release(self, obj):
        """Return object to pool."""
        self._index -= 1
        self.pool[self._index] = obj

System Configuration¶

Linux Kernel Tuning¶

# /etc/sysctl.conf
# Disable CPU frequency scaling
kernel.sched_latency_ns = 1000000
kernel.sched_min_granularity_ns = 500000

# Network optimization
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Disable interrupt coalescence
# ethtool -C eth0 rx-usecs 0 tx-usecs 0

# Disable NUMA balancing
kernel.numa_balancing = 0

# Isolate CPU cores
isolcpus=2,3,4,5 nohz_full=2,3,4,5 rcu_nocbs=2,3,4,5

BIOS Settings¶

- Disable C-states (CPU sleep states)
- Disable P-states (CPU frequency scaling)
- Disable Hyper-threading (for critical cores)
- Enable Turbo Boost
- Set power profile to "Performance"
- Disable unused peripherals
- Set PCIe to Gen 4/5

Latency Measurement¶

import time

class LatencyMeter:
    """Measure latency at various points."""

    def __init__(self):
        self.measurements = {}

    def measure(self, name: str):
        """Record a latency measurement."""
        self.measurements[name] = self.measurements.get(name, [])
        self.measurements[name].append(time.perf_counter_ns())

    def get_stats(self, name: str) -> dict:
        """Get latency statistics."""
        if name not in self.measurements or len(self.measurements[name]) < 2:
            return {}

        diffs = []
        timestamps = self.measurements[name]
        for i in range(1, len(timestamps)):
            diffs.append(timestamps[i] - timestamps[i-1])

        import numpy as np
        return {
            'count': len(diffs),
            'mean_ns': np.mean(diffs),
            'median_ns': np.median(diffs),
            'p50_ns': np.percentile(diffs, 50),
            'p90_ns': np.percentile(diffs, 90),
            'p99_ns': np.percentile(diffs, 99),
            'p999_ns': np.percentile(diffs, 99.9),
            'max_ns': np.max(diffs),
            'min_ns': np.min(diffs),
        }

    def measure_tick_to_trade(self, market_data_time: int, 
                               order_sent_time: int) -> int:
        """Measure tick-to-trade latency."""
        return order_sent_time - market_data_time


# Example usage
meter = LatencyMeter()

# Measure market data processing
md_start = time.perf_counter_ns()
process_market_data(data)
md_end = time.perf_counter_ns()

# Measure order generation
order_start = time.perf_counter_ns()
generate_order(signal)
order_end = time.perf_counter_ns()

print(f"Market data processing: {(md_end - md_start) / 1000:.1f} μs")
print(f"Order generation: {(order_end - order_start) / 1000:.1f} μs")

Checklist¶

[ ] Latency budget defined for each component
[ ] Kernel bypass enabled (OpenOnload/DPDK)
[ ] CPU pinning configured
[ ] Memory pre-allocated (no runtime allocation)
[ ] Lock-free data structures used
[ ] GC disabled or minimized (for C++/Rust)
[ ] Network tuned (interrupt coalescing disabled)
[ ] Co-location evaluated for key venues
[ ] Latency measured end-to-end
[ ] P99 and P99.9 latency tracked (not just average)
[ ] System monitored for jitter
[ ] Backup/failover system tested

References¶

Barry, M. (2019). Building Low Latency Applications with C++. Packt.
Siddiqi, H. (2020). Low-Latency Trading Systems. Self-published.
Solarflare. (2023). "OpenOnload User Guide." Xilinx.
Intel. (2023). "DPDK Developer Guide." Intel Corporation.