Monitoring and Alerting¶

Difficulty expert

Overview¶

Real-time monitoring and alerting ensure trading systems operate correctly and anomalies are detected immediately.

What to Monitor¶

System Health¶

Metric	Threshold	Action
CPU Usage	> 80%	Scale up / investigate
Memory Usage	> 85%	Restart / investigate
Disk Usage	> 90%	Clean up / expand
Network Latency	> 100ms	Check connectivity
Process Uptime	Any restart	Alert immediately

Trading Metrics¶

Metric	Threshold	Action
P&L deviation	> 2σ from expected	Pause trading
Order rejection rate	> 5%	Investigate broker
Fill rate	< 90%	Check order params
Latency	> target	Investigate network
Position limits	> 90% of limit	Alert risk manager

Data Quality¶

Metric	Threshold	Action
Data feed gap	> 1 second	Switch to backup
Price staleness	> 5 seconds	Alert
Missing fields	> 0	Investigate source
Volume anomaly	> 3σ	Alert

Alerting Framework¶

class AlertManager:
    """Manage trading system alerts."""

    def __init__(self, channels):
        self.channels = channels  # Slack, email, PagerDuty
        self.alert_history = []
        self.suppression_rules = {}

    def send_alert(self, severity, title, message, context=None):
        """Send alert to appropriate channels."""
        alert = {
            'timestamp': datetime.now(),
            'severity': severity,  # critical, warning, info
            'title': title,
            'message': message,
            'context': context
        }

        # Check suppression
        if self.is_suppressed(alert):
            return

        self.alert_history.append(alert)

        # Route to channels based on severity
        if severity == 'critical':
            self.send_to_all(alert)
        elif severity == 'warning':
            self.send_to(['slack', 'email'], alert)
        else:
            self.send_to(['slack'], alert)

    def is_suppressed(self, alert):
        """Check if alert should be suppressed (deduplication)."""
        key = f"{alert['title']}:{alert['message']}"
        if key in self.suppression_rules:
            last_alert = self.suppression_rules[key]
            if (alert['timestamp'] - last_alert).seconds < 300:  # 5 min
                return True
        self.suppression_rules[key] = alert['timestamp']
        return False

Dashboard Components¶

┌──────────────────────────────────────────────────┐
│              TRADING DASHBOARD                    │
├────────────┬────────────┬──────────┬─────────────┤
│  P&L       │  Positions │  Orders  │  System     │
│            │            │          │             │
│ Today:     │ AAPL: 500  │ Pending: │ CPU: 45%    │
│ $12,450    │ GOOGL: 200 │ Filled:  │ MEM: 62%    │
│ MTD:       │ MSFT: 300  │ Rejected:│ DISK: 71%   │
│ $45,200    │            │          │ LAT: 12ms   │
│            │            │          │             │
│ [Chart]    │ [Table]    │ [List]   │ [Gauges]    │
├────────────┴────────────┴──────────┴─────────────┤
│  Alerts                                          │
│  [10:32] WARNING: High order rejection rate      │
│  [10:15] INFO: Data feed reconnected             │
│  [09:45] CRITICAL: Position limit exceeded       │
└──────────────────────────────────────────────────┘

Monitoring Stack¶

Prometheus + Grafana¶

# prometheus.yml
scrape_configs:
  - job_name: 'trading_system'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'market_data'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9091']

Custom Metrics¶

from prometheus_client import Counter, Gauge, Histogram

# Define metrics
trades_total = Counter('trades_total', 'Total trades executed', ['symbol', 'side'])
pnl_gauge = Gauge('pnl_current', 'Current P&L')
latency_histogram = Histogram('order_latency_ms', 'Order execution latency')
position_gauge = Gauge('position_size', 'Current position size', ['symbol'])

# Record metrics
def record_trade(symbol, side, pnl, latency):
    trades_total.labels(symbol=symbol, side=side).inc()
    pnl_gauge.set(pnl)
    latency_histogram.observe(latency)

Incident Response¶

class IncidentResponse:
    """Handle trading system incidents."""

    def __init__(self):
        self.runbooks = {
            'data_feed_failure': self.handle_data_feed_failure,
            'order_rejection_spike': self.handle_order_rejections,
            'pnl_anomaly': self.handle_pnl_anomaly,
            'position_limit_breach': self.handle_position_limit,
        }

    def handle_incident(self, incident_type, context):
        """Execute incident response runbook."""
        handler = self.runbooks.get(incident_type)
        if handler:
            handler(context)
        else:
            self.escalate(context)

    def handle_pnl_anomaly(self, context):
        """Respond to unusual P&L movement."""
        # 1. Pause trading
        trading_engine.pause()
        # 2. Check positions
        positions = risk_engine.get_positions()
        # 3. Verify data
        data_quality = check_data_quality()
        # 4. Alert team
        alert_manager.send_alert('critical', 'P&L Anomaly', context)
        # 5. Investigate
        investigate(context)

Practical Guidelines¶

Monitor Everything — You can't fix what you don't know about
Set Sensible Thresholds — Too many alerts = alert fatigue
Deduplicate — Suppress repeated alerts
Have Runbooks — Document response procedures
Test Alerts — Regularly verify alerting works
Dashboard for Humans — Make it readable at a glance
Escalation Path — Know who to call when things break

Next Steps¶

System Design — Full system architecture
Live Deployment — Going live with monitoring
Technology Stack — Infrastructure choices