Control Plane Deep Dive

This document explains how Wavry control-plane services work in production, how they fail, and how to operate them safely.

Scope

Control plane includes:

wavry-gateway: authentication, session APIs, signaling WebSocket, relay-session broker
wavry-master: relay registry, lease issuance, relay selection
wavry-relay: encrypted UDP forwarding fallback path

Data plane includes:

wavry-server and wavry-client encrypted media/input transport

Control-plane services should be deployed as Docker services in production. Native release binaries for gateway and relay are intentionally not shipped.

Architectural Responsibilities

Service	Primary responsibility	Must never do
Gateway	user/session auth, signaling rendezvous, relay metadata broker	decrypt media payloads
Master	relay registration, health state, signed lease issuance	forward media packets
Relay	encrypted UDP packet forwarding between peers	inspect/decrypt media/application payloads

Trust Model

Client/host authenticate through gateway session/token mechanisms.
Master signs relay leases with bounded validity.
Relay validates lease claims and forwards encrypted packets only.
End-to-end media/input confidentiality is maintained by RIFT crypto layers, not by relay trust.

Session Setup Sequence

Client and host bind/authenticate via gateway signaling.
Session metadata and direct-candidate data are exchanged.
If direct path is unavailable, master-selected relay lease is issued.
Relay validates lease and session identity constraints.
Media/input packets flow directly or relay-forwarded.

Relay Lease Lifecycle

Relay lease lifecycle should be treated as a strict state machine:

Issued: signed by master with bounded TTL.
Presented: peer presents lease to relay.
Accepted: relay verifies signature, key id, relay id, time window.
Active: relay forwards packets while lease is valid and session remains healthy.
Renewed or Expired: renewal extends valid window; expiration terminates forwarding eligibility.

Operationally important constraints:

reject future nbf claims beyond allowed skew
reject expired leases immediately
reject leases bound to wrong relay id
reject replayed/duplicated lease-present packets

Failure Modes and Expected Behavior

Failure mode	Expected behavior	Operator action
Gateway unhealthy	new session setup fails fast	restore gateway health, preserve DB state
Master unhealthy	relay assignment and lease issuance fail	fail over or restore master; verify signing key availability
Relay unhealthy	affected sessions degrade/disconnect	drain relay, remove from active selection, replace instance
Bad master key rotation	lease rejects spike	verify `kid`, relay public key config, roll forward/rollback key plan
High relay load	dropped/rejected sessions increase	scale relay pool, enforce load shedding and rate controls
NAT churn/rebind spikes	peer address changes rise	ensure NAT rebinding handling paths remain enabled and tested

Scaling Strategy

Control-plane scaling and runtime scaling should be decoupled:

Scale gateway on auth/API/signaling pressure.
Scale master on lease/registry pressure.
Scale relay on UDP forwarding pressure and region coverage.

Recommended minimums:

multi-instance gateway behind stable ingress
relay pools segmented by region
relay state/health tracked continuously by master

Docker-Only Deployment Policy

Production policy:

Deploy gateway/relay via Docker images only.
Pin explicit image tags for production.
Avoid floating tags in long-lived environments.
Keep relay health endpoints private unless required by control tooling.

See Docker Control Plane for deployment commands and base environment variables.

Security Hardening Baseline

Keep insecure dev flags disabled in production:
- WAVRY_RELAY_ALLOW_INSECURE_DEV=0
- avoid insecure runtime overrides
Set and rotate strong admin/auth tokens.
Restrict public binds unless explicitly intended.
Keep signing key material out of repository and image layers.
Audit auth failures, rate-limit triggers, and admin actions.

Observability Baseline

Track at minimum:

gateway auth success/failure rates
signaling bind failures and timeouts
relay register/heartbeat freshness
lease issue/reject rates by reason
relay packet forward/drop/rate-limit counters

Alert examples:

gateway health failure > 2 minutes
relay registration inventory drops below expected baseline
lease rejects spike by signature/key-id mismatch
sudden direct-to-relay ratio collapse after rollout

Incident Runbook Entry Points

When control plane is degraded:

Identify impacted layer: gateway, master, relay, or network perimeter.
Stop blast radius first: drain failing relay or isolate failing gateway instance.
Restore session setup path before quality tuning.
Validate with smoke flows and health endpoints.
Record root cause and enforce follow-up actions.

Primary companion docs:

Deployment Validation Checklist

Before promoting a control-plane change:

cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --locked
gateway auth smoke passes
master/relay smoke passes
relay drain/recover flow verified
logs and metrics confirm expected behavior after canary

Practical Notes

Keep relay and auth flow documentation close to code changes.
Keep runbooks executable by operators who are not the original implementers.
Treat lease/token behavior as release-critical: small mistakes can cause wide session impact.

Scope​

Architectural Responsibilities​

Trust Model​

Session Setup Sequence​

Relay Lease Lifecycle​

Failure Modes and Expected Behavior​

Scaling Strategy​

Docker-Only Deployment Policy​

Security Hardening Baseline​

Observability Baseline​

Incident Runbook Entry Points​

Deployment Validation Checklist​

Practical Notes​