Runbooks and Checklists
Use this page as a practical checklist hub.
Daily Operations Checklist
- control-plane containers healthy (
docker compose ... ps) - gateway health endpoint responding
- no unusual auth or handshake failure spikes
- direct/relay ratio within expected baseline
- no unexplained host runtime crash patterns
Weekly Reliability Checklist
- review top recurring error signatures
- validate alert routing and on-call ownership
- verify backup integrity for gateway persistent data
- compare latency distribution against prior week
- confirm Linux preflight remains green on current host image
Release Checklist
cargo fmt --allcargo clippy --workspace --all-targets -- -D warningscargo test --workspace --locked- desktop checks and website docs build
- verify artifact names and checksums
- verify Docker image tags/manifests
- canary deploy and metric comparison before full rollout
Incident Response Checklist
- classify affected layer (control plane, runtime, network)
- stabilize user impact first
- capture logs and timestamps before restarts when possible
- apply smallest safe corrective action
- verify recovery and monitor for regression
- record post-incident actions with owner and due date
Linux/Wayland Validation Checklist
- run
./scripts/linux-display-smoke.sh - verify portal backend availability
- verify PipeWire health
- verify runtime backend behavior on KDE and GNOME lanes
- capture and track failure matrix outcomes
If recurring Linux failures occur, use Linux Production Playbook as the primary escalation document.
Change Approval Checklist
Before rolling out high-impact changes:
- rollback plan documented
- compatibility assumptions verified
- canary scope and success criteria defined
- observability and alert thresholds prepared
- owner assigned for live rollout window
Relay Drain and Recover Runbook
Drain a relay safely:
- set relay state to
Drainingusing master admin API - confirm no new relay assignments are issued for that relay
- wait until active sessions drop to zero (or maintenance threshold)
- stop or restart the relay instance
- return relay to
ActiveorProbationonce health checks pass
Recover a quarantined relay:
- verify heartbeat freshness and load behavior
- validate relay key configuration (
WAVRY_RELAY_MASTER_PUBLIC_KEY) - run packet-path smoke checks (lease + forward)
- set relay state back to
Probationfirst, thenActiveafter stability window
Release Channel Checklist
- confirm target version follows policy (
stable,-canary, or-unstablefor public tags) - confirm prerelease tags use only allowed suffixes (
-canaryor-unstable) - validate release artifact names are platform/arch labeled
- verify
SHA256SUMS.txtand release manifest are present
Master Signing Key Rotation Runbook
- provision new signing key material and choose a new
WAVRY_MASTER_KEY_ID - restart master with new key and key id
- verify
/.well-known/wavry-idpublishes the expected key id - verify relay registration response includes
master_key_id - validate new lease issuance and relay acceptance (
/readyon both master and relay) - monitor reject rates (
key id mismatch,wrong relay,invalid signature) during rollout window