Monitoring and Alerting
Difficulty expert
Overview
Real-time monitoring and alerting ensure trading systems operate correctly and anomalies are detected immediately.
What to Monitor
System Health
| Metric |
Threshold |
Action |
| CPU Usage |
> 80% |
Scale up / investigate |
| Memory Usage |
> 85% |
Restart / investigate |
| Disk Usage |
> 90% |
Clean up / expand |
| Network Latency |
> 100ms |
Check connectivity |
| Process Uptime |
Any restart |
Alert immediately |
Trading Metrics
| Metric |
Threshold |
Action |
| P&L deviation |
> 2σ from expected |
Pause trading |
| Order rejection rate |
> 5% |
Investigate broker |
| Fill rate |
< 90% |
Check order params |
| Latency |
> target |
Investigate network |
| Position limits |
> 90% of limit |
Alert risk manager |
Data Quality
| Metric |
Threshold |
Action |
| Data feed gap |
> 1 second |
Switch to backup |
| Price staleness |
> 5 seconds |
Alert |
| Missing fields |
> 0 |
Investigate source |
| Volume anomaly |
> 3σ |
Alert |
Alerting Framework
class AlertManager:
"""Manage trading system alerts."""
def __init__(self, channels):
self.channels = channels # Slack, email, PagerDuty
self.alert_history = []
self.suppression_rules = {}
def send_alert(self, severity, title, message, context=None):
"""Send alert to appropriate channels."""
alert = {
'timestamp': datetime.now(),
'severity': severity, # critical, warning, info
'title': title,
'message': message,
'context': context
}
# Check suppression
if self.is_suppressed(alert):
return
self.alert_history.append(alert)
# Route to channels based on severity
if severity == 'critical':
self.send_to_all(alert)
elif severity == 'warning':
self.send_to(['slack', 'email'], alert)
else:
self.send_to(['slack'], alert)
def is_suppressed(self, alert):
"""Check if alert should be suppressed (deduplication)."""
key = f"{alert['title']}:{alert['message']}"
if key in self.suppression_rules:
last_alert = self.suppression_rules[key]
if (alert['timestamp'] - last_alert).seconds < 300: # 5 min
return True
self.suppression_rules[key] = alert['timestamp']
return False
Dashboard Components
┌──────────────────────────────────────────────────┐
│ TRADING DASHBOARD │
├────────────┬────────────┬──────────┬─────────────┤
│ P&L │ Positions │ Orders │ System │
│ │ │ │ │
│ Today: │ AAPL: 500 │ Pending: │ CPU: 45% │
│ $12,450 │ GOOGL: 200 │ Filled: │ MEM: 62% │
│ MTD: │ MSFT: 300 │ Rejected:│ DISK: 71% │
│ $45,200 │ │ │ LAT: 12ms │
│ │ │ │ │
│ [Chart] │ [Table] │ [List] │ [Gauges] │
├────────────┴────────────┴──────────┴─────────────┤
│ Alerts │
│ [10:32] WARNING: High order rejection rate │
│ [10:15] INFO: Data feed reconnected │
│ [09:45] CRITICAL: Position limit exceeded │
└──────────────────────────────────────────────────┘
Monitoring Stack
Prometheus + Grafana
# prometheus.yml
scrape_configs:
- job_name: 'trading_system'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'market_data'
scrape_interval: 1s
static_configs:
- targets: ['localhost:9091']
Custom Metrics
from prometheus_client import Counter, Gauge, Histogram
# Define metrics
trades_total = Counter('trades_total', 'Total trades executed', ['symbol', 'side'])
pnl_gauge = Gauge('pnl_current', 'Current P&L')
latency_histogram = Histogram('order_latency_ms', 'Order execution latency')
position_gauge = Gauge('position_size', 'Current position size', ['symbol'])
# Record metrics
def record_trade(symbol, side, pnl, latency):
trades_total.labels(symbol=symbol, side=side).inc()
pnl_gauge.set(pnl)
latency_histogram.observe(latency)
Incident Response
class IncidentResponse:
"""Handle trading system incidents."""
def __init__(self):
self.runbooks = {
'data_feed_failure': self.handle_data_feed_failure,
'order_rejection_spike': self.handle_order_rejections,
'pnl_anomaly': self.handle_pnl_anomaly,
'position_limit_breach': self.handle_position_limit,
}
def handle_incident(self, incident_type, context):
"""Execute incident response runbook."""
handler = self.runbooks.get(incident_type)
if handler:
handler(context)
else:
self.escalate(context)
def handle_pnl_anomaly(self, context):
"""Respond to unusual P&L movement."""
# 1. Pause trading
trading_engine.pause()
# 2. Check positions
positions = risk_engine.get_positions()
# 3. Verify data
data_quality = check_data_quality()
# 4. Alert team
alert_manager.send_alert('critical', 'P&L Anomaly', context)
# 5. Investigate
investigate(context)
Practical Guidelines
- Monitor Everything — You can't fix what you don't know about
- Set Sensible Thresholds — Too many alerts = alert fatigue
- Deduplicate — Suppress repeated alerts
- Have Runbooks — Document response procedures
- Test Alerts — Regularly verify alerting works
- Dashboard for Humans — Make it readable at a glance
- Escalation Path — Know who to call when things break
Next Steps