Operations
This page is the practical operator baseline for running Wavry in production-like environments.
Operational Objectives
- Keep interactive latency stable.
- Detect regressions quickly.
- Deploy safely with fast rollback.
- Maintain secure control-plane posture.
Deployment Model
Typical topology:
- Docker-only control plane (
gateway,relay) - host runtime pools (
wavry-server) - user-facing clients (desktop/mobile/web integrations)
Use separate scaling strategies for:
- control plane (API and registration behavior)
- data plane (session/media load)
Reliability Baseline (SLIs)
Track these minimum service indicators:
- session setup success rate
- handshake failure rate
- direct vs relay ratio
- p95/p99 input-to-present latency
- session drop rate
Set SLO targets per environment tier (dev/stage/prod).
Monitoring Baseline
Collect at minimum:
- control-plane health and request rates
- auth failures and rate-limit triggers
- relay registration/heartbeat health
- host runtime CPU/GPU pressure
- RTT/loss/jitter trend series
Alert examples:
- gateway health endpoint failing for > 2 minutes
- relay registrations drop below expected region baseline
- handshake failures spike above normal envelope
- direct-path ratio sharply drops after rollout
Control-Plane Operating Discipline
For gateway/master/relay operations:
- Keep lease/signing key status visible in dashboards.
- Track relay state distribution (
Active,Draining,Probation) over time. - Alert on relay rejection reason spikes (signature mismatch, wrong relay id, expired lease).
- Keep an explicit drain-and-recover runbook for unstable relays.
- Treat gateway auth/signal errors as user-impacting indicators.
Runbook: Daily
- Confirm control-plane container health.
- Review top error classes in gateway and relay logs.
- Check direct/relay ratio trend for anomalies.
- Verify no unexplained session failure burst.
Runbook: Release Day
cargo fmt --allcargo clippy --workspace --all-targets -- -D warningscargo test --workspace --locked- Desktop checks (
bun run check) - Website docs build (
bun run build) - Validate release artifact naming and checksums
- Confirm docker image tags and manifests
Linux Fleet Operations
For Linux host fleets (especially Wayland):
- Pin compositor and portal backend package versions.
- Pin required GStreamer plugin set.
- Run
./scripts/linux-display-smoke.shon candidate images. - Keep at least one KDE Wayland lane and one GNOME Wayland lane in verification.
- Track portal and PipeWire failure rates as first-class signals.
Change Management
For non-trivial changes:
- stage in representative network conditions
- deploy to canary cohort
- compare pre/post latency and failure metrics
- expand rollout only if indicators remain within target bounds
Always define rollback threshold before rollout begins.
Capacity Planning
Plan capacity for:
- peak concurrent sessions
- relay burst conditions
- host CPU/GPU saturation behavior
- region failover traffic shifts
Keep headroom and avoid running near steady-state capacity ceilings.
Incident Response Flow
- Classify impact (control-plane, data-plane, client runtime).
- Stabilize by limiting blast radius.
- Restore availability first, then optimize quality.
- Capture root cause with precise timeline.
- Assign follow-up fixes with owners and due dates.
Monthly Reliability Drills
Run at least one rehearsal per month:
- relay drain/recover drill in non-production
- master restart and readiness validation
- gateway restart and auth/signal continuity validation
- rollback drill using previous known-good release tags/images
Backup and Recovery
Minimum recommendations:
- backup gateway persistent state on a schedule
- keep versioned environment configuration
- validate restore path in staging periodically