Operations

This page is the practical operator baseline for running Wavry in production-like environments.

Operational Objectives

Keep interactive latency stable.
Detect regressions quickly.
Deploy safely with fast rollback.
Maintain secure control-plane posture.

Deployment Model

Typical topology:

Docker-only control plane (gateway, relay)
host runtime pools (wavry-server)
user-facing clients (desktop/mobile/web integrations)

Use separate scaling strategies for:

control plane (API and registration behavior)
data plane (session/media load)

Reliability Baseline (SLIs)

Track these minimum service indicators:

session setup success rate
handshake failure rate
direct vs relay ratio
p95/p99 input-to-present latency
session drop rate

Set SLO targets per environment tier (dev/stage/prod).

Monitoring Baseline

Collect at minimum:

control-plane health and request rates
auth failures and rate-limit triggers
relay registration/heartbeat health
host runtime CPU/GPU pressure
RTT/loss/jitter trend series

Alert examples:

gateway health endpoint failing for > 2 minutes
relay registrations drop below expected region baseline
handshake failures spike above normal envelope
direct-path ratio sharply drops after rollout

Control-Plane Operating Discipline

For gateway/master/relay operations:

Keep lease/signing key status visible in dashboards.
Track relay state distribution (Active, Draining, Probation) over time.
Alert on relay rejection reason spikes (signature mismatch, wrong relay id, expired lease).
Keep an explicit drain-and-recover runbook for unstable relays.
Treat gateway auth/signal errors as user-impacting indicators.

Runbook: Daily

Confirm control-plane container health.
Review top error classes in gateway and relay logs.
Check direct/relay ratio trend for anomalies.
Verify no unexplained session failure burst.

Runbook: Release Day

cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --locked
Desktop checks (bun run check)
Website docs build (bun run build)
Validate release artifact naming and checksums
Confirm docker image tags and manifests

Linux Fleet Operations

For Linux host fleets (especially Wayland):

Pin compositor and portal backend package versions.
Pin required GStreamer plugin set.
Run ./scripts/linux-display-smoke.sh on candidate images.
Keep at least one KDE Wayland lane and one GNOME Wayland lane in verification.
Track portal and PipeWire failure rates as first-class signals.

Change Management

For non-trivial changes:

stage in representative network conditions
deploy to canary cohort
compare pre/post latency and failure metrics
expand rollout only if indicators remain within target bounds

Always define rollback threshold before rollout begins.

Capacity Planning

Plan capacity for:

peak concurrent sessions
relay burst conditions
host CPU/GPU saturation behavior
region failover traffic shifts

Keep headroom and avoid running near steady-state capacity ceilings.

Incident Response Flow

Classify impact (control-plane, data-plane, client runtime).
Stabilize by limiting blast radius.
Restore availability first, then optimize quality.
Capture root cause with precise timeline.
Assign follow-up fixes with owners and due dates.

Monthly Reliability Drills

Run at least one rehearsal per month:

relay drain/recover drill in non-production
master restart and readiness validation
gateway restart and auth/signal continuity validation
rollback drill using previous known-good release tags/images

Backup and Recovery

Minimum recommendations:

backup gateway persistent state on a schedule
keep versioned environment configuration
validate restore path in staging periodically

Operational Objectives​

Deployment Model​

Reliability Baseline (SLIs)​

Monitoring Baseline​

Control-Plane Operating Discipline​

Runbook: Daily​

Runbook: Release Day​

Linux Fleet Operations​

Change Management​

Capacity Planning​

Incident Response Flow​

Monthly Reliability Drills​

Backup and Recovery​

Related Docs​