Observability and Alerting
Observability determines whether you catch latency and reliability regressions early.
Signals to Track
Minimum baseline:
- session setup success/failure
- handshake error rate
- direct vs relay ratio
- p95/p99 latency indicators
- session drop/disconnect rates
Linux host focus:
- portal preflight failures
- PipeWire/session backend failures
- host encoder/capture startup failures
Log Strategy
Use structured logs and include:
- timestamp with timezone
- component (
gateway,relay,host,client) - session/correlation identifiers when available
- clear error category and action context
Do not log:
- secrets/tokens/private key material
- sensitive payload data
Alert Baseline
Create alerts for:
- gateway health endpoint failure
- auth failure surge
- handshake failure surge
- relay registration/heartbeat anomalies
- sudden direct-to-relay ratio shift
Tune thresholds per environment tier.
Dashboard Baseline
At minimum, maintain dashboards for:
- control-plane health and request volume
- session quality and adaptation trends
- relay utilization and error states
- Linux host runtime health markers
Release Observability Checklist
Before rollout:
- baseline current metrics
- define rollback thresholds
- enable focused alerting during rollout window
After rollout:
- compare latency and failure deltas
- verify no hidden increase in relay usage
- confirm error distribution stability
Troubleshooting Correlation Tips
When an issue appears:
- align client, host, and control-plane timestamps
- compare before/after deployment windows
- isolate whether change is network, runtime, or control-plane induced