Skip to content

Monitoring and Alerting

Difficulty expert

Overview

Real-time monitoring and alerting ensure trading systems operate correctly and anomalies are detected immediately.

What to Monitor

System Health

Metric Threshold Action
CPU Usage > 80% Scale up / investigate
Memory Usage > 85% Restart / investigate
Disk Usage > 90% Clean up / expand
Network Latency > 100ms Check connectivity
Process Uptime Any restart Alert immediately

Trading Metrics

Metric Threshold Action
P&L deviation > 2σ from expected Pause trading
Order rejection rate > 5% Investigate broker
Fill rate < 90% Check order params
Latency > target Investigate network
Position limits > 90% of limit Alert risk manager

Data Quality

Metric Threshold Action
Data feed gap > 1 second Switch to backup
Price staleness > 5 seconds Alert
Missing fields > 0 Investigate source
Volume anomaly > 3σ Alert

Alerting Framework

class AlertManager:
    """Manage trading system alerts."""

    def __init__(self, channels):
        self.channels = channels  # Slack, email, PagerDuty
        self.alert_history = []
        self.suppression_rules = {}

    def send_alert(self, severity, title, message, context=None):
        """Send alert to appropriate channels."""
        alert = {
            'timestamp': datetime.now(),
            'severity': severity,  # critical, warning, info
            'title': title,
            'message': message,
            'context': context
        }

        # Check suppression
        if self.is_suppressed(alert):
            return

        self.alert_history.append(alert)

        # Route to channels based on severity
        if severity == 'critical':
            self.send_to_all(alert)
        elif severity == 'warning':
            self.send_to(['slack', 'email'], alert)
        else:
            self.send_to(['slack'], alert)

    def is_suppressed(self, alert):
        """Check if alert should be suppressed (deduplication)."""
        key = f"{alert['title']}:{alert['message']}"
        if key in self.suppression_rules:
            last_alert = self.suppression_rules[key]
            if (alert['timestamp'] - last_alert).seconds < 300:  # 5 min
                return True
        self.suppression_rules[key] = alert['timestamp']
        return False

Dashboard Components

┌──────────────────────────────────────────────────┐
│              TRADING DASHBOARD                    │
├────────────┬────────────┬──────────┬─────────────┤
│  P&L       │  Positions │  Orders  │  System     │
│            │            │          │             │
│ Today:     │ AAPL: 500  │ Pending: │ CPU: 45%    │
│ $12,450    │ GOOGL: 200 │ Filled:  │ MEM: 62%    │
│ MTD:       │ MSFT: 300  │ Rejected:│ DISK: 71%   │
│ $45,200    │            │          │ LAT: 12ms   │
│            │            │          │             │
│ [Chart]    │ [Table]    │ [List]   │ [Gauges]    │
├────────────┴────────────┴──────────┴─────────────┤
│  Alerts                                          │
│  [10:32] WARNING: High order rejection rate      │
│  [10:15] INFO: Data feed reconnected             │
│  [09:45] CRITICAL: Position limit exceeded       │
└──────────────────────────────────────────────────┘

Monitoring Stack

Prometheus + Grafana

# prometheus.yml
scrape_configs:
  - job_name: 'trading_system'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'market_data'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9091']

Custom Metrics

from prometheus_client import Counter, Gauge, Histogram

# Define metrics
trades_total = Counter('trades_total', 'Total trades executed', ['symbol', 'side'])
pnl_gauge = Gauge('pnl_current', 'Current P&L')
latency_histogram = Histogram('order_latency_ms', 'Order execution latency')
position_gauge = Gauge('position_size', 'Current position size', ['symbol'])

# Record metrics
def record_trade(symbol, side, pnl, latency):
    trades_total.labels(symbol=symbol, side=side).inc()
    pnl_gauge.set(pnl)
    latency_histogram.observe(latency)

Incident Response

class IncidentResponse:
    """Handle trading system incidents."""

    def __init__(self):
        self.runbooks = {
            'data_feed_failure': self.handle_data_feed_failure,
            'order_rejection_spike': self.handle_order_rejections,
            'pnl_anomaly': self.handle_pnl_anomaly,
            'position_limit_breach': self.handle_position_limit,
        }

    def handle_incident(self, incident_type, context):
        """Execute incident response runbook."""
        handler = self.runbooks.get(incident_type)
        if handler:
            handler(context)
        else:
            self.escalate(context)

    def handle_pnl_anomaly(self, context):
        """Respond to unusual P&L movement."""
        # 1. Pause trading
        trading_engine.pause()
        # 2. Check positions
        positions = risk_engine.get_positions()
        # 3. Verify data
        data_quality = check_data_quality()
        # 4. Alert team
        alert_manager.send_alert('critical', 'P&L Anomaly', context)
        # 5. Investigate
        investigate(context)

Practical Guidelines

  1. Monitor Everything — You can't fix what you don't know about
  2. Set Sensible Thresholds — Too many alerts = alert fatigue
  3. Deduplicate — Suppress repeated alerts
  4. Have Runbooks — Document response procedures
  5. Test Alerts — Regularly verify alerting works
  6. Dashboard for Humans — Make it readable at a glance
  7. Escalation Path — Know who to call when things break

Next Steps