Skip to main content

Runbooks and Checklists

Use this page as a practical checklist hub.

Daily Operations Checklist

control-plane containers healthy (docker compose ... ps)
gateway health endpoint responding
no unusual auth or handshake failure spikes
direct/relay ratio within expected baseline
no unexplained host runtime crash patterns

Weekly Reliability Checklist

review top recurring error signatures
validate alert routing and on-call ownership
verify backup integrity for gateway persistent data
compare latency distribution against prior week
confirm Linux preflight remains green on current host image

Release Checklist

cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --locked
desktop checks and website docs build
verify artifact names and checksums
verify Docker image tags/manifests
canary deploy and metric comparison before full rollout

Incident Response Checklist

classify affected layer (control plane, runtime, network)
stabilize user impact first
capture logs and timestamps before restarts when possible
apply smallest safe corrective action
verify recovery and monitor for regression
record post-incident actions with owner and due date

Linux/Wayland Validation Checklist

run ./scripts/linux-display-smoke.sh
verify portal backend availability
verify PipeWire health
verify runtime backend behavior on KDE and GNOME lanes
capture and track failure matrix outcomes

If recurring Linux failures occur, use Linux Production Playbook as the primary escalation document.

Change Approval Checklist

Before rolling out high-impact changes:

rollback plan documented
compatibility assumptions verified
canary scope and success criteria defined
observability and alert thresholds prepared
owner assigned for live rollout window

Relay Drain and Recover Runbook

Drain a relay safely:

set relay state to Draining using master admin API
confirm no new relay assignments are issued for that relay
wait until active sessions drop to zero (or maintenance threshold)
stop or restart the relay instance
return relay to Active or Probation once health checks pass

Recover a quarantined relay:

verify heartbeat freshness and load behavior
validate relay key configuration (WAVRY_RELAY_MASTER_PUBLIC_KEY)
run packet-path smoke checks (lease + forward)
set relay state back to Probation first, then Active after stability window

Release Channel Checklist

confirm target version follows policy (stable, -canary, or -unstable for public tags)
confirm prerelease tags use only allowed suffixes (-canary or -unstable)
validate release artifact names are platform/arch labeled
verify SHA256SUMS.txt and release manifest are present

Master Signing Key Rotation Runbook

provision new signing key material and choose a new WAVRY_MASTER_KEY_ID
restart master with new key and key id
verify /.well-known/wavry-id publishes the expected key id
verify relay registration response includes master_key_id
validate new lease issuance and relay acceptance (/ready on both master and relay)
monitor reject rates (key id mismatch, wrong relay, invalid signature) during rollout window

Daily Operations Checklist
Weekly Reliability Checklist
Release Checklist
Incident Response Checklist
Linux/Wayland Validation Checklist
Change Approval Checklist
Relay Drain and Recover Runbook
Release Channel Checklist
Master Signing Key Rotation Runbook
Related Docs