From 69537095743a2f4831ba3f53388a32ac92d799eb Mon Sep 17 00:00:00 2001
From: Axodouble <git@axodouble.com>
Date: Fri, 15 May 2026 04:05:30 +0000
Subject: [PATCH] AI assisted documentation

---
 README.md                          |  17 ++
 docs/README.md                     |  53 ++++++
 docs/architecture.md               | 196 +++++++++++++++++++++
 docs/configuration.md              | 273 +++++++++++++++++++++++++++++
 docs/deployment/docker.md          | 198 +++++++++++++++++++++
 docs/deployment/public-internet.md | 180 +++++++++++++++++++
 docs/deployment/systemd.md         | 250 ++++++++++++++++++++++++++
 docs/deployment/tailscale.md       | 181 +++++++++++++++++++
 docs/installation.md               | 104 +++++++++++
 docs/operations.md                 | 225 ++++++++++++++++++++++++
 docs/security.md                   | 153 ++++++++++++++++
 docs/troubleshooting.md            | 199 +++++++++++++++++++++
 12 files changed, 2029 insertions(+)
 create mode 100644 docs/README.md
 create mode 100644 docs/architecture.md
 create mode 100644 docs/configuration.md
 create mode 100644 docs/deployment/docker.md
 create mode 100644 docs/deployment/public-internet.md
 create mode 100644 docs/deployment/systemd.md
 create mode 100644 docs/deployment/tailscale.md
 create mode 100644 docs/installation.md
 create mode 100644 docs/operations.md
 create mode 100644 docs/security.md
 create mode 100644 docs/troubleshooting.md

diff --git a/README.md b/README.md
index 5c8d8ec..e4ff6bb 100644
--- a/README.md
+++ b/README.md
@@ -27,6 +27,23 @@ definition, can't tell you when it's the one that's down. `qu` solves
 both: run it on a few cheap hosts in different networks and they vote
 on truth. If one of them loses its uplink, the rest keep alerting.
 
+## Documentation
+
+This README is the quick-start. For production use, the longer guides
+live under [`docs/`](docs/README.md):
+
+| If you want to…                                       | Read                                                               |
+| ----------------------------------------------------- | ------------------------------------------------------------------ |
+| understand the consensus / replication model          | [docs/architecture.md](docs/architecture.md)                       |
+| reference every field in `node.yaml` / `cluster.yaml` | [docs/configuration.md](docs/configuration.md)                     |
+| deploy on Linux with systemd hardening                | [docs/deployment/systemd.md](docs/deployment/systemd.md)           |
+| deploy with Docker / docker-compose                   | [docs/deployment/docker.md](docs/deployment/docker.md)             |
+| deploy over Tailscale or WireGuard                    | [docs/deployment/tailscale.md](docs/deployment/tailscale.md)       |
+| expose `qu` on the open internet safely               | [docs/deployment/public-internet.md](docs/deployment/public-internet.md) |
+| upgrade, back up, or recover from failures            | [docs/operations.md](docs/operations.md)                           |
+| understand the trust model and rotate identities      | [docs/security.md](docs/security.md)                               |
+| diagnose a misbehaving cluster                        | [docs/troubleshooting.md](docs/troubleshooting.md)                 |
+
 ## Architecture
 
 ```
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..4e972ca
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,53 @@
+# QUptime documentation
+
+Production-oriented documentation for `qu`, a small distributed uptime
+monitor that votes on the health of HTTP/TCP/ICMP targets across a
+cluster of cooperating nodes.
+
+The top-level `README.md` is the marketing pitch and quick-start. The
+pages here go deeper and are organised by what you're trying to do.
+
+## Getting set up
+
+- [Installation](installation.md) — pre-built binaries, building from
+  source, verifying release artifacts, what the install script does.
+- [Configuration](configuration.md) — `node.yaml`, `cluster.yaml`,
+  `trust.yaml`, environment variables, file layout, defaults.
+
+## Running it
+
+- [Architecture](architecture.md) — how nodes form quorum, how a master
+  is elected, how cluster state replicates, what happens during a
+  partition, and exactly which guarantees the design gives you.
+- [Operations](operations.md) — day-2 tasks: upgrades, backups,
+  recovery from a lost node, recovery from a lost quorum, monitoring
+  `qu` itself.
+- [Security](security.md) — the mTLS / TOFU trust model, what the
+  cluster secret protects, how to rotate keys, what to put on a public
+  network and what not to.
+- [Troubleshooting](troubleshooting.md) — common failure modes with
+  the log lines you'll see and the fix.
+
+## Deployment recipes
+
+Pick the one that matches your environment. They share most of the
+operational guidance — what differs is how `qu` is packaged and how
+the inter-node link is secured at the network layer.
+
+- [systemd on bare metal / VM](deployment/systemd.md) — single static
+  binary, hardened unit file, `CAP_NET_RAW` for ICMP.
+- [Docker / docker-compose](deployment/docker.md) — official image,
+  single-node and multi-node compose files, persistent volumes.
+- [Tailscale / WireGuard overlay](deployment/tailscale.md) — nodes in
+  separate networks with no public ingress; cluster traffic stays on
+  the tailnet.
+- [Public-internet exposure](deployment/public-internet.md) — when
+  you have no overlay and `:9901` is reachable from the open
+  internet: firewalling, rate-limiting, secret hygiene.
+
+## A note on stability
+
+The wire protocol (`internal/transport`) and the on-disk format
+(`cluster.yaml`, `node.yaml`, `trust.yaml`) are considered stable
+within a minor version. Breaking changes will bump the major version
+and ship with a migration note.
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..84f6a5f
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,196 @@
+# Architecture
+
+This page is the long-form companion to the diagram in the top-level
+README. Read it if you need to reason about partitions, recovery,
+upgrade ordering, or the consistency guarantees of `qu`.
+
+## Components
+
+A running `qu serve` is one process containing five long-lived
+goroutines plus the listeners:
+
+| Component       | Package                  | Role                                                                     |
+| --------------- | ------------------------ | ------------------------------------------------------------------------ |
+| Transport       | `internal/transport`     | mTLS listener + dialer, length-prefixed JSON-RPC framing.                |
+| Quorum manager  | `internal/quorum`        | 1 Hz heartbeats, liveness tracking, deterministic master election.       |
+| Replicator      | `internal/replicate`     | Master-routed mutations, version-gated broadcast and pull.               |
+| Scheduler       | `internal/checks`        | One goroutine per check; runs HTTP/TCP/ICMP probes on each node.         |
+| Aggregator      | `internal/checks`        | Master-only. Folds per-node probe results into a cluster-wide verdict.   |
+| Alert dispatch  | `internal/alerts`        | Master-only. Renders templates and ships SMTP / Discord notifications.   |
+| Control socket  | `internal/daemon`        | Local-only unix socket; the CLI and TUI talk to the daemon through it.   |
+
+Every node runs every component. Whether the master-only ones actually
+*do* anything depends on the result of master election.
+
+## Trust and transport
+
+Inter-node traffic is TLS 1.3 with mutual authentication. There is **no
+central CA**. Each node generates a self-signed RSA cert at `qu init`
+and the SPKI fingerprint of that cert is what other nodes pin against.
+
+Two layers gate access:
+
+1. **TLS layer** accepts any client cert. This avoids a chicken-and-egg
+   during bootstrap — a brand-new node has no entry in anyone's trust
+   store yet, so a strict TLS check would refuse the very first
+   handshake.
+2. **RPC dispatcher** rejects every method except `Join` for callers
+   whose presented fingerprint is not in `trust.yaml`. So an untrusted
+   peer can knock on the door but cannot ask questions.
+
+`Join` itself is gated by the **cluster secret** — a pre-shared base64
+string generated at `qu init` on the first node. Without it, an
+attacker who can reach `:9901` cannot enrol themselves into the
+cluster.
+
+The local CLI talks to the daemon over a unix socket with `0600`
+permissions; filesystem ACLs are the only authentication and no TLS is
+used on that channel.
+
+## The replicated state machine
+
+`cluster.yaml` is the single replicated source of truth. It holds three
+editable lists — `peers`, `checks`, `alerts` — plus three
+server-controlled fields:
+
+```yaml
+version: 7                 # monotonically increasing
+updated_at: 2026-05-15T...
+updated_by: <node-id>      # master that committed this version
+peers:  [...]
+checks: [...]
+alerts: [...]
+```
+
+### How mutations flow
+
+1. The CLI (or the manual-edit watcher; see below) issues a mutation
+   on the local daemon's control socket.
+2. The daemon's replicator looks at the current quorum view:
+   - If there is no quorum, the mutation fails loudly with
+     `no quorum: refusing mutation`.
+   - If this node is the master, apply locally and broadcast.
+   - Otherwise, ship the mutation to the master via the
+     `ProposeMutation` RPC and wait for the result.
+3. The master holds the cluster lock, applies the mutation, bumps
+   `version`, writes `cluster.yaml` atomically, and broadcasts the new
+   snapshot to every peer via `ApplyClusterCfg`.
+4. Each follower's `Replace` accepts the snapshot **only if**
+   `incoming.Version > local.Version`. Older or equal versions are
+   dropped silently.
+
+The mutation kinds are enumerated in `internal/transport/messages.go`:
+`add_check`, `remove_check`, `add_alert`, `remove_alert`, `add_peer`,
+`remove_peer`, `replace_config`.
+
+### Manual edits to `cluster.yaml`
+
+Operators can `sudoedit /etc/quptime/cluster.yaml` on any node. Every
+2 seconds the daemon hashes the file. When the on-disk hash diverges
+from the last hash the daemon wrote, the new content is parsed and
+forwarded to the master as a `replace_config` mutation. So a hand-edit
+on a follower still ends up on the master, version-bumped, and
+broadcast everywhere.
+
+If the parse fails (invalid YAML), the daemon logs and pins the bad
+hash so it doesn't loop. The operator's next valid save unblocks it.
+
+## Quorum and master election
+
+Every node sends a heartbeat to every peer once per second. A peer is
+**live** if a heartbeat (sent or received) was observed within the
+last 4 seconds — comfortably more than three missed beats so a one-tick
+blip does not unseat the master.
+
+**Quorum** is met when `len(live_peers) >= floor(N/2) + 1` where `N`
+is the total peer count in `cluster.yaml`. Below quorum, the cluster
+refuses every mutation; existing checks continue probing locally but no
+state transitions are committed (the master is the only one who
+aggregates, and there is no master).
+
+**Master election** is deterministic with no negotiation step: among
+the live members, the master is the one with the lexicographically
+smallest `NodeID`. Every node that observes the same live set picks the
+same master — so there is no split-brain window even during a partial
+partition.
+
+The `term` integer in `qu status` is bumped every time the elected
+master changes (including transitions to and from "no master"). Use it
+to spot flappy clusters.
+
+## Catch-up when a node reconnects
+
+This is the scenario most people ask about: node C is offline, the
+master commits config version 7, node C comes back online. What
+happens?
+
+1. Node C's tick loop fires heartbeats every second regardless of its
+   previous state. There is no backoff, no give-up.
+2. Each heartbeat carries the sender's `Version`. Each response carries
+   the responder's `Version`.
+3. The first time C sees a peer reporting a higher version than its
+   own, the version-observer fires and calls
+   `replicator.PullFrom(peerID, addr)`.
+4. `PullFrom` does a `GetClusterCfg` RPC against that peer and feeds
+   the snapshot through `Replace`, which writes `cluster.yaml`
+   atomically and refreshes the on-disk hash so the manual-edit
+   watcher doesn't re-fire.
+5. Within ~1 heartbeat C is byte-for-byte identical to the master.
+
+The same path catches a stale node up when the partition heals on the
+minority side: the minority side cannot mutate, so when it rejoins it
+strictly has the older version, and the pull fires.
+
+There is one corner case worth knowing about: the pull only fires when
+`peer_version > local_version`. Two nodes at the same version with
+different content would silently diverge — but the design forbids
+that (only the master mutates, and the master is the only one bumping
+the version) unless somebody hand-edits `cluster.yaml` and also
+manually sets `version:`. Don't do that.
+
+## Why a check flips state
+
+The aggregator runs on the master only. Followers' probe results are
+shipped to the master via the `ReportResult` RPC; the master's own
+probe results are submitted directly.
+
+For each check, the aggregator keeps the latest result per node within
+a freshness window (3× the check interval, minimum 30s). On each
+incoming submission it counts OK vs not-OK across the fresh results:
+
+- 0 fresh reports → `unknown`
+- more OK than not-OK → `up`
+- more not-OK than OK → `down`
+- tie → `up` (a tie at one report means one node says yes and one says
+  no; biasing toward `up` avoids false alerts when nodes disagree
+  transiently).
+
+A state flip is **not** committed immediately. Hysteresis requires the
+candidate state to hold for **two consecutive aggregate evaluations**
+before the state transition fires and the alert dispatcher is called.
+Set in `internal/checks/aggregator.go` as the `HysteresisCount`
+constant — change it there if you want a hair-trigger or a slower
+alert.
+
+If the master changes, the new master starts the per-check state from
+`unknown` and rebuilds it as fresh results arrive. The first few
+seconds after a re-election can therefore show `unknown` even for
+checks that were `up` a moment ago.
+
+## What `qu` does *not* do
+
+These omissions are intentional in v1 and useful to know up front:
+
+- **No persistent history.** Only the current aggregate state lives in
+  memory. There are no graphs, no SLA reports. Add a sidecar (Prometheus
+  exporter, SQLite logger) if you need them.
+- **No automatic key rotation.** Re-init a node and re-trust if you
+  need to roll its identity. See [security.md](security.md).
+- **No multi-tenant isolation.** One cluster = one set of checks =
+  one alert tree.
+- **No web UI.** Operator surface is `qu` (CLI), `qu tui`, and direct
+  edits to `cluster.yaml`.
+- **No automatic peer eviction on prolonged downtime.** A dead peer
+  stays in `cluster.yaml` until an operator runs `qu node remove`,
+  because that decision affects the quorum size and shouldn't happen
+  silently.
diff --git a/docs/configuration.md b/docs/configuration.md
new file mode 100644
index 0000000..750635f
--- /dev/null
+++ b/docs/configuration.md
@@ -0,0 +1,273 @@
+# Configuration
+
+This page is the canonical reference for the on-disk files, the
+environment variables, and every field that `qu` reads. It's
+deliberately tedious — when something doesn't behave the way you
+expect, this is where the answer lives.
+
+## File layout
+
+When running as **root** (the typical case under systemd):
+
+```
+/etc/quptime/
+├── node.yaml          identity, never replicated
+├── cluster.yaml       replicated state
+├── trust.yaml         local fingerprint trust store
+└── keys/
+    ├── private.pem    RSA private key (0600)
+    ├── public.pem     RSA public key
+    └── cert.pem       self-signed X.509 cert
+
+/var/run/quptime/quptime.sock   control socket (0600)
+```
+
+When running as a **non-root** user (the typical case for `go run` or a
+desktop test):
+
+```
+~/.config/quptime/...                       same shape as /etc/quptime
+$XDG_RUNTIME_DIR/quptime/quptime.sock       control socket
+```
+
+Override the data directory with `QUPTIME_DIR=/some/path qu serve`.
+Override the socket path with `QUPTIME_SOCKET=/run/foo.sock`.
+
+## Environment variables
+
+| Variable          | Purpose                                                                                                                   |
+| ----------------- | ------------------------------------------------------------------------------------------------------------------------- |
+| `QUPTIME_DIR`     | Data directory. Defaults to `/etc/quptime` (root) or `$XDG_CONFIG_HOME/quptime`.                                          |
+| `QUPTIME_SOCKET`  | Path to the CLI ↔ daemon unix socket. Defaults to `/var/run/quptime/quptime.sock` (root) or `$XDG_RUNTIME_DIR/quptime/…`. |
+| `XDG_CONFIG_HOME` | Honored when running as non-root and `QUPTIME_DIR` is unset.                                                              |
+| `XDG_RUNTIME_DIR` | Honored when running as non-root and `QUPTIME_SOCKET` is unset.                                                           |
+
+The daemon does not read any other environment variables. SMTP, Discord,
+and HTTP probe targets are configured exclusively in `cluster.yaml`.
+
+## `node.yaml` — local identity
+
+Never replicated. One file per host. Generated by `qu init`.
+
+```yaml
+node_id: 7f3a5b9e-...        # UUIDv4, immutable after init
+bind_addr: 0.0.0.0           # listen address for :9901
+bind_port: 9901              # listen port
+advertise: alpha.example.com:9901   # how peers reach us; may differ from bind
+cluster_secret: 4hZqK8vT9... # base64; required to Join, never replicated
+```
+
+### Field reference
+
+- `node_id` — UUIDv4 generated at `qu init`. Used by every peer to
+  refer to this node across IP changes and restarts. Do not edit.
+- `bind_addr` — Address the daemon listens on. `0.0.0.0` is the
+  default. Set to `127.0.0.1` if you only want to expose the daemon
+  through an overlay (Tailscale, WireGuard) — see
+  [deployment/tailscale.md](deployment/tailscale.md).
+- `bind_port` — Defaults to `9901`. Change here if 9901 is taken; the
+  cluster does not require port-uniformity, peers just need to know
+  what to dial via the `advertise` field.
+- `advertise` — Host:port other nodes use to reach this one. Must be
+  routable from every peer. Falls back to `bind_addr:bind_port` if
+  unset, which is rarely what you want behind NAT.
+- `cluster_secret` — Pre-shared base64 string. Required on every
+  `Join` RPC; constant-time comparison on the receiver. Generate on
+  the first node, distribute out-of-band, keep out of version
+  control.
+
+### How `qu init` populates this file
+
+```sh
+qu init \
+  --advertise alpha.example.com:9901 \
+  --bind 0.0.0.0 \
+  --port 9901 \
+  --secret '<paste from first node, or omit on the first node>'
+```
+
+Idempotent in one direction only: if `node.yaml` exists, `qu init`
+refuses to overwrite. To re-init, delete the data directory entirely.
+
+## `cluster.yaml` — replicated state
+
+This is the file that every node converges on. The master is the only
+one allowed to bump `version`; followers `Replace` it whole each time
+they receive a higher-versioned snapshot.
+
+```yaml
+version: 12
+updated_at: 2026-05-15T14:01:00Z
+updated_by: 7f3a5b9e-...
+peers:
+  - node_id: 7f3a5b9e-...
+    advertise: alpha.example.com:9901
+    fingerprint: SHA256:abcd...
+    cert_pem: |
+      -----BEGIN CERTIFICATE-----
+      ...
+      -----END CERTIFICATE-----
+checks:
+  - id: 0006a1...
+    name: homepage
+    type: http
+    target: https://example.com
+    interval: 30s
+    timeout: 10s
+    expect_status: 200
+    alert_ids: [oncall]
+    suppress_alert_ids: []
+alerts:
+  - id: f001ab...
+    name: oncall
+    type: discord
+    default: true
+    discord_webhook: https://discord.com/api/webhooks/...
+    body_template: |
+      :rotating_light: {{.Check.Name}} is {{.Verb}}
+```
+
+### Top-level fields
+
+| Field        | Owner    | Notes                                                                              |
+| ------------ | -------- | ---------------------------------------------------------------------------------- |
+| `version`    | master   | Monotonic. Followers reject snapshots whose version is ≤ their local.              |
+| `updated_at` | master   | UTC RFC3339. Cosmetic — humans use it, no logic depends on it.                     |
+| `updated_by` | master   | NodeID of the committing master.                                                   |
+| `peers`      | editable | Cluster members. Edits go through `add_peer` / `remove_peer` mutations.            |
+| `checks`     | editable | Monitored targets.                                                                 |
+| `alerts`     | editable | Notifier destinations.                                                             |
+
+### `peers[]`
+
+```yaml
+- node_id: 7f3a5b9e-...        # immutable, the peer's own UUID
+  advertise: host:port         # how anyone dials this peer
+  fingerprint: SHA256:...      # SPKI fingerprint of the peer's cert
+  cert_pem: |                  # full PEM so other peers can mTLS without a separate invite
+    -----BEGIN CERTIFICATE-----
+    ...
+```
+
+The `cert_pem` field is what enables N-node clusters without N×(N-1)
+manual invites: when peer X is added via the master, every other node
+that receives the new `cluster.yaml` learns X's cert at the same time
+and adds it to the local trust store. See
+`internal/daemon/daemon.go:syncTrustFromCluster`.
+
+### `checks[]`
+
+```yaml
+- id: 0006a1...           # UUIDv4, generated when the check is created
+  name: homepage          # human-friendly, must be unique within cluster
+  type: http              # http | tcp | icmp
+  target: https://example.com
+  interval: 30s           # Go duration syntax: 5s, 1m30s, 2h
+  timeout: 10s            # default 10s
+  expect_status: 200      # http only; 0 = accept anything < 400
+  body_match: "OK"        # http only; substring match on response body
+  alert_ids: [oncall]     # alerts attached explicitly
+  suppress_alert_ids: []  # opt out of specific default alerts
+```
+
+Defaults:
+
+- `interval`: 30s
+- `timeout`: 10s
+- `expect_status`: 0 → any 2xx is OK; otherwise the configured status
+  must match exactly.
+
+ICMP checks default to **unprivileged UDP-mode pings** so the daemon
+does not need root. For raw ICMP, grant the capability — see
+[deployment/systemd.md](deployment/systemd.md).
+
+### `alerts[]`
+
+Two notifier kinds, distinguished by `type`:
+
+```yaml
+# Discord
+- id: f001ab...
+  name: oncall
+  type: discord
+  default: true              # attach to every check automatically
+  discord_webhook: https://...
+  body_template: |           # optional Go text/template override
+    {{.Check.Name}} is {{.Verb}}
+
+# SMTP
+- id: f002cd...
+  name: ops
+  type: smtp
+  smtp_host: smtp.example.com
+  smtp_port: 587
+  smtp_user: mailbot
+  smtp_password: '...'
+  smtp_from: monitor@example.com
+  smtp_to: [ops@example.com]
+  smtp_starttls: true
+  subject_template: '[{{.Verb}}] {{.Check.Name}}'
+  body_template: |
+    Check {{.Check.Name}} ({{.Check.Target}}) is now {{.Verb}}.
+```
+
+If `default: true`, the alert fires for every check unless the check
+lists the alert's ID or name in `suppress_alert_ids`. Otherwise the
+alert only fires for checks that name it in `alert_ids`.
+
+Templates are Go `text/template`. The full variable list is in the
+top-level README under "Custom alert messages" — `qu alert add smtp
+--help` and `qu alert add discord --help` print the same table.
+
+### Suppression precedence
+
+For each check, the dispatcher computes the effective alert list as:
+
+```
+( explicit alert_ids ∪ alerts with default=true ) \ suppress_alert_ids
+```
+
+de-duplicated by alert ID. So a check can both opt in to specific
+alerts and opt out of specific defaults.
+
+## `trust.yaml` — local trust store
+
+A flat list of fingerprints this node accepts. One entry per peer,
+populated by `qu node add` (or pulled in automatically when a peer's
+cert arrives via the replicated `cluster.yaml`).
+
+```yaml
+entries:
+  - node_id: 7f3a5b9e-...
+    address: alpha.example.com:9901
+    fingerprint: SHA256:...
+    cert_pem: |
+      -----BEGIN CERTIFICATE-----
+      ...
+```
+
+Never edit this by hand. Use `qu trust list` and `qu trust remove`.
+
+## Key material
+
+`keys/private.pem` is the only secret on disk besides
+`node.yaml.cluster_secret`. It's chmod 0600 by default; preserve that.
+The public cert at `keys/cert.pem` is what gets fingerprinted and
+shipped in `cluster.yaml.peers[].cert_pem`.
+
+There is **no automatic key rotation**. Rolling a node's identity
+means wiping its data directory, running `qu init` again, and
+re-adding it from another node as a fresh peer.
+
+## Tunables that don't live in YAML
+
+A few values are compiled constants. Change them in source and rebuild
+if you need different behaviour.
+
+| Constant                                              | Default | What it does                                                  |
+| ----------------------------------------------------- | ------- | ------------------------------------------------------------- |
+| `quorum.DefaultHeartbeatInterval`                     | `1s`    | How often each node heartbeats every peer.                    |
+| `quorum.DefaultDeadAfter`                             | `4s`    | A peer is dead if no heartbeat is seen within this window.    |
+| `checks.HysteresisCount`                              | `2`     | Consecutive aggregate evaluations needed before a state flip. |
+| `checks.ReconcileInterval`                            | `5s`    | How often the scheduler reconciles its workers vs `checks[]`. |
+| `daemon.manualEditPollInterval` (`internal/daemon/watcher.go`) | `2s`    | How often the daemon hashes `cluster.yaml` for hand edits.    |
diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md
new file mode 100644
index 0000000..7f22607
--- /dev/null
+++ b/docs/deployment/docker.md
@@ -0,0 +1,198 @@
+# Deployment: Docker / docker-compose
+
+The published image is a 14 MB distroless static container with the
+`qu` binary as the entrypoint. It runs as root by default so the
+daemon can bind privileged ports and open ICMP sockets; override with
+`--user` if your host doesn't need that.
+
+## Image references
+
+```
+git.cer.sh/axodouble/quptime:master          # tip of main, multi-arch
+git.cer.sh/axodouble/quptime:v0.1.0          # tagged release
+git.cer.sh/axodouble/quptime:v0.1.0-amd64    # single-arch (if you must pin)
+```
+
+The image embeds `QUPTIME_DIR=/etc/quptime` and declares it a volume —
+treat it as the only piece of state worth persisting.
+
+## Single-node, single-container compose
+
+For a development cluster or a single-node smoke test:
+
+```yaml
+# compose.yaml
+services:
+  quptime:
+    image: git.cer.sh/axodouble/quptime:v0.1.0
+    container_name: quptime
+    restart: unless-stopped
+    ports:
+      - "9901:9901"
+    volumes:
+      - quptime-data:/etc/quptime
+    # ICMP UDP-mode pings need a permissive sysctl on the host:
+    #   sysctl net.ipv4.ping_group_range="0 2147483647"
+    # Or grant CAP_NET_RAW (more accurate, raw ICMP).
+    cap_add:
+      - NET_RAW
+
+volumes:
+  quptime-data:
+```
+
+You must **`qu init` before the daemon will start**. With this compose
+file:
+
+```sh
+docker compose run --rm quptime init --advertise <host-ip>:9901
+docker compose up -d
+docker compose exec quptime qu status
+```
+
+`<host-ip>` must be reachable from every other node — the loopback
+address inside the container is useless to peers.
+
+## Three-node compose on a single host
+
+For local testing of the full quorum machinery without three machines:
+
+```yaml
+# compose.yaml
+x-quptime: &quptime
+  image: git.cer.sh/axodouble/quptime:v0.1.0
+  restart: unless-stopped
+  cap_add:
+    - NET_RAW
+
+services:
+  alpha:
+    <<: *quptime
+    container_name: alpha
+    ports: ["9901:9901"]
+    volumes: ["alpha-data:/etc/quptime"]
+
+  bravo:
+    <<: *quptime
+    container_name: bravo
+    ports: ["9902:9901"]
+    volumes: ["bravo-data:/etc/quptime"]
+
+  charlie:
+    <<: *quptime
+    container_name: charlie
+    ports: ["9903:9901"]
+    volumes: ["charlie-data:/etc/quptime"]
+
+volumes:
+  alpha-data:
+  bravo-data:
+  charlie-data:
+```
+
+Bootstrap:
+
+```sh
+# First node: prints the secret to stdout.
+docker compose run --rm alpha init --advertise alpha:9901
+# Capture the secret (or read it back from alpha-data).
+SECRET=$(docker compose exec alpha cat /etc/quptime/node.yaml | grep cluster_secret | awk '{print $2}')
+
+docker compose run --rm bravo   init --advertise bravo:9901   --secret "$SECRET"
+docker compose run --rm charlie init --advertise charlie:9901 --secret "$SECRET"
+
+docker compose up -d
+
+# Invite from alpha. The hostnames resolve over the compose network.
+docker compose exec alpha qu node add bravo:9901
+sleep 3   # wait for heartbeats before the next add
+docker compose exec alpha qu node add charlie:9901
+
+docker compose exec alpha qu status
+```
+
+For a cluster on three separate hosts, replicate the compose file on
+each box with different `advertise` addresses (the public hostname or
+the overlay IP) and bootstrap the same way.
+
+## Multi-host compose
+
+The natural unit is one compose file per host, each running one
+`qu` container. The minimum-viable file per host:
+
+```yaml
+# /etc/qu-stack/compose.yaml
+services:
+  quptime:
+    image: git.cer.sh/axodouble/quptime:v0.1.0
+    container_name: quptime
+    restart: unless-stopped
+    ports:
+      - "9901:9901"
+    volumes:
+      - /srv/quptime/data:/etc/quptime
+    cap_add:
+      - NET_RAW
+```
+
+Persistence is a bind-mount under `/srv/quptime/data` so backups and
+upgrades hit a known path. See [operations.md](../operations.md) for
+the backup recipe.
+
+Inter-host traffic on TCP/9901 must be reachable. If the boxes don't
+share a private network, prefer the
+[Tailscale recipe](tailscale.md) over exposing 9901 directly — see
+[public-internet.md](public-internet.md) for the threat model if you
+must expose it.
+
+## Behind a reverse proxy
+
+**Don't.** `qu` is mTLS-pinned at the application layer, so a TLS-
+terminating proxy would force the daemon to trust whatever cert the
+proxy presents — defeating fingerprint pinning. If you need a single
+public address per node, use a Layer 4 TCP proxy (`nginx stream`,
+HAProxy `mode tcp`, or a plain firewall NAT) that forwards bytes
+without touching them.
+
+## Image internals
+
+Build locally if you want to inspect what you're running:
+
+```sh
+docker buildx build \
+  --build-arg VERSION=$(git describe --tags --always) \
+  --platform linux/amd64,linux/arm64 \
+  --file docker/Dockerfile \
+  --tag quptime:dev \
+  --load \
+  .
+```
+
+The Dockerfile (see `docker/Dockerfile`) is two stages: a `golang:1.24-alpine`
+builder that cross-compiles with `-trimpath -ldflags "-s -w"`, and a
+`gcr.io/distroless/static-debian12` runtime. No shell, no package
+manager, no SSH; you cannot `docker exec -it sh` into it. Use
+`docker exec quptime qu ...` for everything.
+
+## Healthcheck
+
+The container exits non-zero if the daemon crashes, so the default
+`restart: unless-stopped` policy is enough for liveness. A more
+useful readiness check requires the binary to be in your healthchecker:
+
+```yaml
+healthcheck:
+  test: ["CMD", "/usr/local/bin/qu", "status"]
+  interval: 30s
+  timeout: 5s
+  retries: 3
+  start_period: 10s
+```
+
+`qu status` exits 0 when the daemon socket is reachable and the
+control RPC succeeds — it does **not** fail on quorum loss. That's
+intentional: restarting a quorum-less node won't bring quorum back,
+and a healthcheck that flaps a follower in and out of `unhealthy`
+state every time the master is briefly unreachable is worse than no
+check. If you want a stricter readiness signal, pipe `qu status`
+through `grep -q 'quorum     true'`.
diff --git a/docs/deployment/public-internet.md b/docs/deployment/public-internet.md
new file mode 100644
index 0000000..a7fd80d
--- /dev/null
+++ b/docs/deployment/public-internet.md
@@ -0,0 +1,180 @@
+# Deployment: public-internet exposure
+
+If your nodes do not share a private network and you can't put an
+overlay between them (see [tailscale.md](tailscale.md)), this is the
+recipe for exposing TCP/9901 directly to the open internet without
+losing sleep.
+
+The short version: `qu` is designed for this — every inbound call is
+mTLS-pinned at the application layer and gated by the cluster secret
+— but defence in depth is cheap and you should take it.
+
+## Threat model in one paragraph
+
+Anyone on the internet can establish a TLS connection to `:9901`
+because the daemon must accept handshakes from currently-untrusted
+peers (otherwise no node could ever join). The RPC dispatcher then
+rejects every method except `Join` for callers whose fingerprint
+isn't in `trust.yaml`. `Join` itself is gated by the **cluster
+secret**, compared in constant time. So the realistic attack surface
+is:
+
+1. The TLS 1.3 stack accepting handshakes from arbitrary peers.
+2. The `Join` handler's secret check and downstream cert ingestion.
+3. The blast radius of a leaked cluster secret (an attacker who has
+   it can enrol themselves as a peer and propose mutations, which is
+   game over).
+
+What can't trivially happen:
+
+- A random attacker observing or modifying cluster traffic — TLS 1.3
+  with fingerprint pinning sees to that.
+- A random attacker calling any method other than `Join` — the RPC
+  dispatcher refuses.
+
+What you should still do:
+
+- Treat `node.yaml.cluster_secret` like an SSH host key. Out-of-band
+  distribution only. Never in git, never in CI logs, never in chat.
+- Rate-limit and IP-allowlist where you can. The `Join` handler does
+  not currently rate-limit at the application layer, so a determined
+  attacker could try secrets at TLS-handshake rate.
+- Run on a non-default port if your operations workflow allows it.
+  Doesn't add security, but reduces background internet noise in the
+  logs and makes IDS / WAF rules cleaner.
+
+## Firewall
+
+### nftables (recommended)
+
+A drop-in `/etc/nftables.d/quptime.nft`:
+
+```nft
+table inet filter {
+  set quptime_peers {
+    type ipv4_addr
+    elements = { 198.51.100.10, 198.51.100.11, 198.51.100.12 }
+  }
+
+  chain quptime_input {
+    # Drop everything that didn't come from a known peer.
+    ip saddr @quptime_peers tcp dport 9901 accept
+    tcp dport 9901 log prefix "quptime-drop: " level info drop
+  }
+
+  chain input {
+    type filter hook input priority 0; policy drop;
+    ct state established,related accept
+    iif lo accept
+    jump quptime_input
+    # ... your other rules
+  }
+}
+```
+
+The allowlist is the highest-ROI mitigation by far — if you maintain
+fixed IPs for your monitor nodes, use this and move on.
+
+### ufw
+
+```sh
+sudo ufw allow from 198.51.100.10 to any port 9901 proto tcp
+sudo ufw allow from 198.51.100.11 to any port 9901 proto tcp
+sudo ufw allow from 198.51.100.12 to any port 9901 proto tcp
+```
+
+### Dynamic peer IPs
+
+If peer IPs aren't fixed (e.g., one node is on a home connection with
+a rotating address), you have three options ranked by preference:
+
+1. Use an overlay instead — see [tailscale.md](tailscale.md). This is
+   the right answer.
+2. DNS-based allowlisting (`ipset`-from-DNS or a small reconciler that
+   re-resolves an allowlist hostname every minute). Beware: a
+   compromised DNS resolver becomes a compromise of the allowlist.
+3. Drop the allowlist and rely solely on the cluster secret + mTLS.
+   This is what `qu` is designed to survive; just be sure the secret
+   actually has the entropy `qu init` generated for it (32 random
+   bytes, base64-encoded).
+
+## Rate-limiting failed handshakes
+
+`qu` does not currently rate-limit `Join` attempts at the application
+layer. You can do it at the firewall, which catches both connect
+floods and slow brute-force:
+
+```nft
+table inet filter {
+  chain quptime_input {
+    tcp dport 9901 ct state new \
+      meter quptime_ratemeter { ip saddr limit rate over 10/second } \
+      log prefix "quptime-rate: " drop
+    tcp dport 9901 accept
+  }
+}
+```
+
+Or `fail2ban` with a tiny custom filter that watches `journalctl -u
+quptime` for repeated `peer rejected join` lines:
+
+```ini
+# /etc/fail2ban/filter.d/quptime.conf
+[Definition]
+failregex = ^.*quptime:.*peer rejected join.*from <ADDR>.*$
+```
+
+```ini
+# /etc/fail2ban/jail.d/quptime.local
+[quptime]
+enabled  = true
+filter   = quptime
+backend  = systemd
+journalmatch = _SYSTEMD_UNIT=quptime.service
+maxretry = 3
+findtime = 600
+bantime  = 86400
+```
+
+Note: the daemon doesn't currently log the *peer address* on rejected
+joins. The log filter above is illustrative; check what your version
+actually emits before relying on it.
+
+## Secret hygiene
+
+The single most important thing on a public-internet deployment:
+
+- **Generate the secret on the first node.** `qu init` with no
+  `--secret` produces 32 random bytes from `crypto/rand`, base64-
+  encoded. Don't replace that with something memorable.
+- **Transport out of band.** Paste it into your secret manager
+  immediately; share via 1Password / Vault / encrypted email.
+- **Rotate if anyone with access has left.** Rotation isn't a CLI
+  command; do it the brute-force way: `qu init` a fresh cluster on
+  new ports, re-add every check via `cluster.yaml` export, swap DNS.
+- **One secret per cluster.** Do not reuse the secret across staging
+  and prod, or across customers if you run several clusters.
+
+## Non-default ports
+
+```sh
+# Each node, in node.yaml — or pass --port on init.
+qu init --advertise alpha.example.com:51234 --port 51234
+```
+
+Open the corresponding firewall rule, restart the daemon. The
+cluster doesn't require uniform ports across nodes; each peer's
+`advertise` field tells everyone else what to dial.
+
+## What you should monitor on a public deployment
+
+- `term` from `qu status` — if it's ticking up frequently the master
+  is flapping, which probably means at least one peer's network is
+  unstable. Could be benign, could be a probe attempt.
+- The firewall drop counter on the `quptime-drop` rule above.
+- The number of TLS handshakes on `:9901`. A spike in handshakes that
+  don't progress to a successful RPC is the signature of a brute-force
+  on the cluster secret.
+
+For the operational side — backups, upgrades, recovery — see
+[operations.md](../operations.md).
diff --git a/docs/deployment/systemd.md b/docs/deployment/systemd.md
new file mode 100644
index 0000000..f08a466
--- /dev/null
+++ b/docs/deployment/systemd.md
@@ -0,0 +1,250 @@
+# Deployment: systemd on bare metal / VM
+
+The canonical way to run `qu` on a Linux host. Single static binary,
+managed by systemd, with a hardened unit file. Most production users
+should start here.
+
+## Audience and assumptions
+
+- You have root (or `sudo`) on the host.
+- You have at least three hosts that can reach each other on TCP/9901.
+  (Three is the minimum for a useful quorum; fewer is fine for
+  development but a 2-node cluster offers no consensus protection.)
+- The hosts have a way to authenticate each other — direct IP or a
+  resolvable hostname is fine. For overlay networks see
+  [tailscale.md](tailscale.md).
+
+## Install the binary
+
+See [installation.md](../installation.md). The official `install.sh`
+script writes a *minimal* unit file that's fine for development. For
+production replace it with the hardened version below.
+
+## Create a dedicated user
+
+Running as a dedicated unprivileged user is best practice, but ICMP
+support adds a wrinkle — see the next section.
+
+```sh
+sudo useradd --system --no-create-home --shell /usr/sbin/nologin quptime
+sudo install -d -o quptime -g quptime -m 0750 /etc/quptime
+sudo install -d -o quptime -g quptime -m 0750 /var/run/quptime
+```
+
+## ICMP capabilities
+
+ICMP probes have two implementations:
+
+1. **Unprivileged UDP pings** — Linux's `dgram` ICMP socket. Works on
+   any modern kernel without elevated privileges, but only if
+   `net.ipv4.ping_group_range` includes the daemon's GID. This is the
+   default in `qu`.
+2. **Raw ICMP** — requires `CAP_NET_RAW`, more accurate latency
+   numbers and works for IPv6 from arbitrary kernels.
+
+The simplest path: stick with unprivileged pings and widen
+`ping_group_range`. Sysctl, persistent across reboots:
+
+```sh
+# /etc/sysctl.d/10-quptime.conf
+net.ipv4.ping_group_range = 0 2147483647
+```
+
+```sh
+sudo sysctl --system
+```
+
+If you need raw ICMP instead, grant the capability on the binary:
+
+```sh
+sudo setcap cap_net_raw=+ep /usr/local/bin/qu
+```
+
+Note that `setcap` is overwritten by every `qu` upgrade — bake the
+`setcap` call into your deploy script, or re-run it after each
+package update.
+
+## Hardened unit file
+
+Drop this in `/etc/systemd/system/quptime.service`:
+
+```ini
+[Unit]
+Description=QUptime distributed uptime monitor
+Documentation=https://git.cer.sh/axodouble/quptime
+Wants=network-online.target
+After=network-online.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/bin/qu serve
+Restart=always
+RestartSec=5s
+
+User=quptime
+Group=quptime
+
+# Where state lives. RuntimeDirectory creates /var/run/quptime/ each
+# boot owned by User:Group with mode 0750.
+Environment=QUPTIME_DIR=/etc/quptime
+RuntimeDirectory=quptime
+RuntimeDirectoryMode=0750
+ReadWritePaths=/etc/quptime /var/run/quptime
+
+# Hardening. Comment out individual directives if a probe needs
+# something we've revoked.
+NoNewPrivileges=true
+ProtectSystem=strict
+ProtectHome=true
+PrivateTmp=true
+PrivateDevices=true
+ProtectKernelTunables=true
+ProtectKernelModules=true
+ProtectControlGroups=true
+ProtectClock=true
+ProtectHostname=true
+RestrictNamespaces=true
+RestrictRealtime=true
+RestrictSUIDSGID=true
+LockPersonality=true
+MemoryDenyWriteExecute=true
+
+# Network access is required (we're a network monitor). Keep address
+# families minimal — AF_NETLINK is needed for some libc lookups.
+RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 AF_NETLINK
+
+# If you need raw ICMP, *also* uncomment:
+# AmbientCapabilities=CAP_NET_RAW
+# CapabilityBoundingSet=CAP_NET_RAW
+# Otherwise drop all capabilities:
+CapabilityBoundingSet=
+
+[Install]
+WantedBy=multi-user.target
+```
+
+Reload systemd and enable:
+
+```sh
+sudo systemctl daemon-reload
+sudo systemctl enable quptime.service
+```
+
+## Initialise the node
+
+**Don't start the service yet** — `qu init` must run first, and it
+must run as the `quptime` user so it creates files with the right
+ownership.
+
+On the **first** host (it will print a secret; copy it):
+
+```sh
+sudo -u quptime QUPTIME_DIR=/etc/quptime \
+  qu init --advertise alpha.example.com:9901
+```
+
+On every **other** host (paste the secret):
+
+```sh
+sudo -u quptime QUPTIME_DIR=/etc/quptime \
+  qu init --advertise bravo.example.com:9901 --secret '<paste>'
+
+sudo -u quptime QUPTIME_DIR=/etc/quptime \
+  qu init --advertise charlie.example.com:9901 --secret '<paste>'
+```
+
+## Open the firewall
+
+`qu` needs TCP/9901 reachable between cluster members. Adjust to your
+firewall:
+
+```sh
+# ufw
+sudo ufw allow from <peer-ip> to any port 9901 proto tcp
+
+# firewalld
+sudo firewall-cmd --permanent --zone=internal \
+  --add-rich-rule='rule family=ipv4 source address=<peer-ip> port port=9901 protocol=tcp accept'
+sudo firewall-cmd --reload
+
+# nftables (drop-in)
+table inet filter {
+  chain input {
+    ip saddr { 10.0.0.10, 10.0.0.11, 10.0.0.12 } tcp dport 9901 accept
+  }
+}
+```
+
+For exposing 9901 to the open internet see
+[public-internet.md](public-internet.md).
+
+## Start the daemon
+
+```sh
+sudo systemctl start quptime
+sudo systemctl status quptime
+journalctl -u quptime -f
+```
+
+## Invite peers
+
+From one node (typically `alpha`):
+
+```sh
+sudo -u quptime qu node add bravo.example.com:9901
+# Pause a few seconds so heartbeats reach the new peer before the next add —
+# otherwise the "needs ≥2 live to mutate" check rejects the second invite.
+sudo -u quptime qu node add charlie.example.com:9901
+```
+
+`qu node add` prints each remote's fingerprint and asks for SSH-style
+confirmation. Verify it matches an out-of-band channel (the remote
+operator can show their fingerprint with
+`sudo -u quptime qu status` or by reading `trust.yaml`).
+
+## Verify
+
+```sh
+sudo -u quptime qu status
+```
+
+Expect to see all three peers `live=true` and one of them as
+`master`.
+
+## Log scraping
+
+`journalctl -u quptime` is the canonical log stream. Notable lines:
+
+| Pattern                                                       | Meaning                                                   |
+| ------------------------------------------------------------- | --------------------------------------------------------- |
+| `listening on ... as node ...`                                | Daemon up.                                                |
+| `manual-edit: cluster.yaml changed externally — replicating…` | An operator edited `cluster.yaml` directly.               |
+| `manual-edit: parse cluster.yaml: ...`                        | Invalid YAML on disk; the operator must fix and re-save.  |
+| `report to master ...: <err>`                                 | A follower couldn't ship a probe result to the master.    |
+| `replicate: pull from ...: <err>`                             | A follower couldn't pull a higher-version config snapshot. |
+
+## Sample reload / restart drill
+
+After editing the unit file:
+
+```sh
+sudo systemctl daemon-reload
+sudo systemctl restart quptime
+```
+
+After editing `cluster.yaml` by hand:
+
+```sh
+sudoedit /etc/quptime/cluster.yaml
+# No restart needed — the watcher picks it up within 2s and pushes to master.
+```
+
+After upgrading the binary:
+
+```sh
+sudo install -m 0755 qu-new /usr/local/bin/qu
+sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
+sudo systemctl restart quptime
+```
+
+Doing rolling upgrades? See [operations.md](../operations.md).
diff --git a/docs/deployment/tailscale.md b/docs/deployment/tailscale.md
new file mode 100644
index 0000000..1b6ae43
--- /dev/null
+++ b/docs/deployment/tailscale.md
@@ -0,0 +1,181 @@
+# Deployment: Tailscale / WireGuard overlay
+
+When your nodes live in different networks — different VPS providers,
+different physical sites, a mix of home and cloud — exposing TCP/9901
+to the open internet is a poor idea. An overlay network gives every
+node a stable private IP regardless of NAT, and `qu` only needs to
+listen on that overlay address.
+
+This page focuses on Tailscale because the repo ships an example
+compose for it, but everything generalises to WireGuard, Nebula, or a
+self-hosted Headscale.
+
+## The big idea
+
+```
++--- host A (VPS, no public ICMP) ----+
+| tailscale ←→ overlay ip 100.64.1.1  |
+| qu listening on 100.64.1.1:9901     |
++-------------------------------------+
+              │   mTLS over overlay
+              ▼
++--- host B (homelab behind NAT) -----+
+| tailscale ←→ overlay ip 100.64.1.2  |
+| qu listening on 100.64.1.2:9901     |
++-------------------------------------+
+```
+
+`bind_addr` is set to the tailscale IP, the host's public interface
+has no port 9901 open, and the cluster secret + mTLS handshake gate
+the link inside the tunnel.
+
+## Compose recipe
+
+The repo ships [`docker/docker-compose-tailscale.yml`](../../docker/docker-compose-tailscale.yml).
+The relevant trick is `network_mode: "service:tailscale"` — the
+`quptime` container shares the network namespace of the `tailscale`
+sidecar so it sees the tailnet as its own interface.
+
+```yaml
+services:
+  tailscale:
+    image: tailscale/tailscale:latest
+    container_name: tailscale
+    cap_add: [NET_ADMIN]
+    environment:
+      - TS_AUTHKEY=${TAILSCALE_AUTHKEY}   # provision via .env
+      - TS_HOSTNAME=quptime-${HOST}       # name visible in admin
+    volumes:
+      - /dev/net/tun:/dev/net/tun
+      - tailscale:/var/lib/tailscale
+    restart: unless-stopped
+
+  quptime:
+    image: git.cer.sh/axodouble/quptime:v0.1.0
+    container_name: quptime
+    volumes:
+      - quptime:/etc/quptime
+    network_mode: "service:tailscale"
+    depends_on: [tailscale]
+    cap_add: [NET_RAW]
+    # No restart directive yet — needs `qu init` first.
+
+volumes:
+  tailscale:
+  quptime:
+```
+
+### One-time bootstrap
+
+Each host runs the same script with different `HOST` and `TAILSCALE_AUTHKEY`:
+
+```sh
+# .env
+HOST=alpha
+TAILSCALE_AUTHKEY=tskey-auth-xxxxxxxx
+```
+
+Start Tailscale alone first so it gets an IP:
+
+```sh
+docker compose up -d tailscale
+sleep 5
+TSIP=$(docker compose exec tailscale tailscale ip --4)
+echo "this node's tailnet IP: $TSIP"
+```
+
+On the **first** host, init without `--secret`:
+
+```sh
+docker compose run --rm quptime init --advertise "$TSIP:9901"
+# Grab the printed secret; pipe through your password manager.
+```
+
+On every **other** host, paste the secret:
+
+```sh
+docker compose run --rm quptime init \
+  --advertise "$TSIP:9901" \
+  --secret "$CLUSTER_SECRET"
+```
+
+Then bring up `qu` on every node and invite from the first:
+
+```sh
+# Each host
+docker compose up -d quptime
+
+# From alpha
+docker compose exec quptime qu node add 100.64.1.2:9901
+sleep 3
+docker compose exec quptime qu node add 100.64.1.3:9901
+
+docker compose exec quptime qu status
+```
+
+## Tailscale ACLs
+
+Belt and braces — even though mTLS pins identities, lock down the
+tailnet itself so only the `qu` nodes can reach each other's :9901.
+In the Tailscale admin console:
+
+```jsonc
+{
+  "tagOwners": { "tag:qu-node": ["group:ops"] },
+  "acls": [
+    {
+      "action": "accept",
+      "src": ["tag:qu-node"],
+      "dst": ["tag:qu-node:9901"]
+    }
+    // ...your other rules
+  ]
+}
+```
+
+Then tag every `qu` node in its auth key:
+
+```yaml
+environment:
+  - TS_AUTHKEY=${TAILSCALE_AUTHKEY}?ephemeral=false&tags=tag:qu-node
+```
+
+## WireGuard / Nebula / Headscale equivalents
+
+The recipe generalises:
+
+1. Provision the overlay interface on each host with a stable
+   private IP (the tunnel's own address).
+2. `qu init --advertise <overlay-ip>:9901`.
+3. Set `bind_addr: <overlay-ip>` in `node.yaml` so the daemon does
+   **not** also listen on the public interface.
+4. Open `:9901` only on the overlay interface in your firewall — for
+   nftables that's something like `iifname "wg0" tcp dport 9901
+   accept`.
+
+The cluster secret and mTLS fingerprints still apply; the overlay just
+removes the open-internet attack surface.
+
+## Why prefer overlay over public exposure
+
+- Single failure domain at the network layer: an attacker who finds an
+  exploit in your overlay client (rare; Tailscale and WireGuard are
+  small surfaces) still hits the application-layer pinning before any
+  cluster-level operation.
+- The cluster secret can be lower-entropy when it's already
+  unreachable from outside. (You should still treat it as a real
+  secret; "defence in depth" only works if every layer is real.)
+- ICMP probes from a homelab to a target on the public internet are
+  trivial through NAT, but ICMP *into* a homelab usually isn't.
+  Running `qu` on a tailnet means peers can heartbeat each other
+  regardless of NAT direction.
+
+## Trade-offs
+
+- One more thing to monitor. If your tailnet is down, your monitor is
+  down. Counter-measure: run *another* tiny `qu` cluster (or a single
+  node) on the public internet that watches the overlay's coordinator
+  health.
+- Probe latency includes the overlay's hop. Tailscale's wireguard is
+  fast (<1 ms LAN, single-digit ms WAN) so this rarely matters, but
+  if you're alerting on tight latency thresholds, account for it.
diff --git a/docs/installation.md b/docs/installation.md
new file mode 100644
index 0000000..71ac850
--- /dev/null
+++ b/docs/installation.md
@@ -0,0 +1,104 @@
+# Installation
+
+`qu` ships as a single static Linux binary. Pick whichever method
+matches how you manage software on the host.
+
+> Choosing a deployment recipe instead? Jump to
+> [systemd](deployment/systemd.md), [Docker](deployment/docker.md),
+> [Tailscale](deployment/tailscale.md), or
+> [public-internet](deployment/public-internet.md).
+
+## Pre-built binary (recommended)
+
+Releases are published to the [Gitea releases
+page](https://git.cer.sh/axodouble/quptime/releases) with a
+`SHA256SUMS` file. Two architectures are built: `linux-amd64` and
+`linux-arm64`.
+
+```sh
+# Always pin to a tag — `latest` resolves on the server side.
+TAG=v0.1.0
+ARCH=amd64   # or arm64
+
+curl -fSL -o qu \
+  "https://git.cer.sh/axodouble/quptime/releases/download/${TAG}/qu-${TAG}-linux-${ARCH}"
+curl -fSL -o SHA256SUMS \
+  "https://git.cer.sh/axodouble/quptime/releases/download/${TAG}/SHA256SUMS"
+
+# Verify before installing.
+sha256sum --check --ignore-missing SHA256SUMS
+
+install -m 0755 qu /usr/local/bin/qu
+```
+
+## One-line install script
+
+The repo ships an `install.sh` that handles the download, checksum,
+shell-completion installation, and a default systemd unit file. Run it
+under `sudo` so it can write to `/usr/local/bin` and
+`/etc/systemd/system`.
+
+```sh
+curl -fsSL https://git.cer.sh/Axodouble/QUptime/raw/branch/master/install.sh | sudo bash
+```
+
+What it does:
+
+1. Looks up the latest release via the Gitea API.
+2. Downloads the binary to `/usr/local/bin/qu`.
+3. Installs bash / zsh / fish completion if a target directory exists.
+4. Writes `/etc/systemd/system/qu-serve.service` and enables it (but
+   does **not** start it — you need to run `qu init` first).
+
+The unit it writes is minimal. For a production unit with hardening,
+see the [systemd deployment guide](deployment/systemd.md).
+
+## Build from source
+
+Requires Go 1.24.2 or newer.
+
+```sh
+git clone https://git.cer.sh/axodouble/quptime.git
+cd quptime
+go build -ldflags "-X main.version=$(git describe --tags --always)" -o qu ./cmd/qu
+
+./qu --version
+```
+
+Static binary, no cgo. `CGO_ENABLED=0` is the default on a clean Go
+install; if you've enabled cgo globally, set it explicitly:
+
+```sh
+CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" -o qu ./cmd/qu
+```
+
+## Docker image
+
+A multi-arch (`amd64` + `arm64`) image is published to the Gitea
+registry on every tag and every push to `master`:
+
+```
+git.cer.sh/axodouble/quptime:master   # tip of main
+git.cer.sh/axodouble/quptime:v0.1.0   # tagged release
+```
+
+See the [Docker deployment guide](deployment/docker.md) for compose
+files and volume layout.
+
+## Verifying the install
+
+```sh
+qu --version
+qu --help
+```
+
+If completions installed, `qu <tab>` will list subcommands. After
+`qu init` you can run `qu status` to confirm the daemon is reachable
+over its control socket.
+
+## Next steps
+
+- [Configure the node and the cluster](configuration.md).
+- Pick a deployment recipe under [docs/deployment/](deployment/).
+- Walk through the [architecture](architecture.md) so the operational
+  guarantees are clear before you commit to a topology.
diff --git a/docs/operations.md b/docs/operations.md
new file mode 100644
index 0000000..185c4db
--- /dev/null
+++ b/docs/operations.md
@@ -0,0 +1,225 @@
+# Operations
+
+Day-2 tasks: keeping `qu` healthy, upgrading without dropping checks,
+backing up state, recovering from failures. Pair this with
+[troubleshooting.md](troubleshooting.md) for "the cluster is on fire,
+what now" specifics.
+
+## Upgrades
+
+### Rolling upgrade (zero alert loss)
+
+`qu` is built to tolerate one node being absent at a time as long as
+quorum still holds. The simple recipe for a 3-node cluster:
+
+```sh
+# On each node in turn:
+sudo systemctl stop quptime
+sudo install -m 0755 qu-new /usr/local/bin/qu
+sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
+sudo systemctl start quptime
+
+# Wait for the node to rejoin before moving on:
+sudo -u quptime qu status   # should show quorum true, all peers live
+```
+
+The first node you upgrade may briefly be a follower with a *higher*
+binary version than the master. That's fine as long as no on-disk
+format changes; the wire protocol and `cluster.yaml` schema are
+stable within a minor version, so minor / patch upgrades freely
+interleave.
+
+For major-version upgrades that change the on-disk format, the release
+notes will spell out the migration. As of v0 there have been none.
+
+### Downgrades
+
+A node that downgrades to an older binary will refuse to start if
+`cluster.yaml` contains fields the older version doesn't know. To
+roll back across a schema change, either:
+
+- Take the cluster offline and downgrade all nodes simultaneously.
+- Restore a `cluster.yaml` from before the schema change on every node
+  before starting the downgraded binary.
+
+Within a single minor version, downgrade is symmetrical with upgrade.
+
+### What can go wrong
+
+- **Restarting two nodes at once in a 3-node cluster** loses quorum.
+  No mutations succeed, no alerts fire. Quorum returns the moment
+  the second node is back.
+- **A node that has been offline for a long time** comes back with a
+  stale `cluster.yaml`. It will pull the master's higher version
+  within ~1 heartbeat. Don't pre-emptively delete its `cluster.yaml`
+  — let the catch-up path handle it.
+
+## Backups
+
+Three files matter, in descending order of "pain if lost":
+
+| File                   | Why back it up                                                       |
+| ---------------------- | -------------------------------------------------------------------- |
+| `node.yaml`            | Holds the cluster secret. Lose it and the node can't rejoin.         |
+| `keys/private.pem`     | Lose it and you must `qu init` a fresh identity and re-trust.        |
+| `cluster.yaml`         | Resyncs from any other live peer, so per-node backup is optional.    |
+
+### Per-host backup
+
+```sh
+# /etc/cron.daily/quptime-backup
+#!/bin/sh
+set -eu
+dst=/var/backups/quptime/$(date +%Y%m%d)
+mkdir -p "$dst"
+cp -a /etc/quptime/node.yaml         "$dst/"
+cp -a /etc/quptime/keys              "$dst/keys"
+cp -a /etc/quptime/cluster.yaml      "$dst/cluster.yaml"
+chmod -R go-rwx "$dst"
+```
+
+### Cluster-wide backup
+
+The cluster state (`peers`, `checks`, `alerts`) is identical across
+every node. Back up one healthy node's `cluster.yaml` and you have
+the canonical copy. To restore:
+
+```sh
+# Stop the daemon.
+sudo systemctl stop quptime
+
+# Drop in the backup. Reset the version to 0 so the running cluster's
+# higher version supersedes whatever you're holding — otherwise this
+# node will broadcast a stale snapshot and confuse everyone.
+sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml
+sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml
+
+sudo systemctl start quptime
+# Within seconds the version-observer pulls the live version from a peer.
+```
+
+If you're restoring **the entire cluster** (every node lost), the
+"reset version to 0" trick doesn't apply — there's no peer with a
+higher version. Pick the highest-version backup, restore that file
+across every node verbatim, and start the daemons. The cluster will
+elect a master and continue.
+
+## Replacing a dead node
+
+A node has died permanently. You want to add a fresh box with the
+same role.
+
+1. On a surviving node, evict the dead one:
+
+   ```sh
+   sudo -u quptime qu node remove <dead-node-id>
+   ```
+
+   This drops it from `cluster.yaml` and removes its trust entry. The
+   live set's size shrinks by one — verify quorum still holds.
+
+2. On the new host, install `qu` and `qu init` against the existing
+   cluster secret:
+
+   ```sh
+   sudo -u quptime qu init \
+     --advertise delta.example.com:9901 \
+     --secret '<existing cluster secret>'
+   sudo systemctl start quptime
+   ```
+
+3. From a surviving node, invite the new one:
+
+   ```sh
+   sudo -u quptime qu node add delta.example.com:9901
+   ```
+
+The dead node's checks and alerts are unaffected — they live in the
+replicated `cluster.yaml`, not the dead node's identity.
+
+## Recovering from lost quorum
+
+You've lost more than half the cluster simultaneously. The remaining
+nodes refuse to mutate (correct behaviour: they have no way to know
+whether the missing nodes are dead or partitioned).
+
+Options:
+
+- **Bring the missing nodes back.** Always the right first move if it's
+  possible. The cluster recovers automatically once enough nodes are
+  live.
+- **Shrink the cluster.** If you've genuinely lost the missing nodes
+  permanently and can't bring them back, you need to manually edit
+  `cluster.yaml` on every surviving node to remove the dead peers,
+  then restart. Be very deliberate:
+
+  ```sh
+  # On each surviving node:
+  sudo systemctl stop quptime
+  sudoedit /etc/quptime/cluster.yaml   # delete the dead peers[] entries
+                                        # bump version to something higher
+  sudo systemctl start quptime
+  ```
+
+  Make sure every surviving node has identical `cluster.yaml` content
+  before restarting any of them. If they don't, you'll get conflicting
+  views of who's in the cluster and elections will flap.
+
+- **Start over.** For small clusters this is often faster than the
+  manual surgery above: `rm -rf /etc/quptime` everywhere, then
+  bootstrap from scratch. You'll lose your checks and alerts unless
+  you saved a copy of `cluster.yaml` elsewhere.
+
+## Monitoring `qu` itself
+
+`qu` watches your services. Who watches `qu`?
+
+### From within the cluster
+
+`qu status` is the single source of truth. The fields to watch:
+
+| Field          | Healthy        | Suspicious                                                |
+| -------------- | -------------- | --------------------------------------------------------- |
+| `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
+| `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
+| `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
+| `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |
+
+A simple cron sentinel on each node:
+
+```sh
+*/5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \
+  || curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall
+```
+
+### From outside the cluster
+
+`qu` does not currently expose a Prometheus / OpenMetrics endpoint.
+The recommended pattern is to run a *separate* tiny monitoring path
+that doesn't depend on `qu` — even a single `curl` health check on
+each node's :9901 (which is TLS-only; you'll see a handshake succeed
+even if the daemon's stuck) catches process death.
+
+To produce structured metrics, write a sidecar that parses `qu status`
+output and exports counters. The CLI emits stable, machine-grep-able
+output specifically so this is straightforward.
+
+## Operational checklist before you go to bed
+
+After standing up a new cluster, work through:
+
+- [ ] All nodes show `quorum true` in `qu status`.
+- [ ] All nodes show identical `config ver`.
+- [ ] All nodes show the same `master`.
+- [ ] `journalctl -u quptime --since "10 min ago"` has no
+      `propose to master:` or `replicate: pull from:` errors.
+- [ ] `qu alert test <name>` reaches your inbox / Discord channel for
+      every configured alert.
+- [ ] At least one check has an intentional failure (a bogus target)
+      that you flip back and forth to verify the full state-transition
+      → dispatch path end-to-end.
+- [ ] Backups of `node.yaml` + `keys/` + `cluster.yaml` are landing in
+      your backup destination.
+- [ ] Firewall allow-list (if any) lists every peer's IP.
+- [ ] You've stored the cluster secret somewhere that survives the
+      first operator leaving.
diff --git a/docs/security.md b/docs/security.md
new file mode 100644
index 0000000..6399bd3
--- /dev/null
+++ b/docs/security.md
@@ -0,0 +1,153 @@
+# Security
+
+The trust model in one page. Read this before deciding where to put
+`qu` and who can talk to it.
+
+## What `qu` is trying to defend against
+
+- **Eavesdropping on cluster traffic.** Defended: TLS 1.3 only,
+  fingerprint-pinned per peer.
+- **MITM on the cluster's inter-node link.** Defended: TLS 1.3 with
+  out-of-band fingerprint verification at `qu node add`.
+- **A random internet host enrolling itself as a peer.** Defended:
+  pre-shared cluster secret on every `Join`.
+- **A compromised peer issuing forged cluster-config mutations.** Not
+  defended. A peer trusted enough to be in `cluster.yaml.peers` can
+  propose mutations through the master. Treat membership as a
+  privilege.
+- **A compromised peer becoming master.** Election is deterministic on
+  the smallest live `NodeID`, so a compromised peer can become master
+  if its `NodeID` sorts first. The master can rewrite `cluster.yaml`
+  arbitrarily. This is the worst-case blast radius from one compromised
+  node.
+- **DoS by handshake flood.** Not directly defended at the application
+  layer. The TLS stack accepts anyone's handshake; rate-limiting belongs
+  at the firewall — see [public-internet.md](deployment/public-internet.md).
+
+## The three secrets on disk
+
+| Secret                     | What it is                                | Loss impact                                  |
+| -------------------------- | ----------------------------------------- | -------------------------------------------- |
+| `keys/private.pem`         | RSA private key, this node's identity.    | Anyone with it can impersonate this node.    |
+| `node.yaml.cluster_secret` | Pre-shared base64 string.                 | Anyone with it can `Join` the cluster.       |
+| `trust.yaml.entries[].cert_pem` | Other peers' public certs (not secrets, but they enable mTLS). | Loss only forces re-trust. |
+
+The first two are real secrets and live under `0600` permissions in
+the data directory. Back them up; never commit them; never paste them
+in chat.
+
+## TLS handshake step by step
+
+For every inter-node call:
+
+1. Caller dials peer on its `advertise` address.
+2. TLS 1.3 handshake. Both sides present their self-signed leaf cert.
+3. The caller's `VerifyPeerCertificate` (set in
+   `internal/transport/tls.go`) computes the SPKI fingerprint of the
+   server's cert and compares it against `trust.yaml`. If the caller
+   knows which `NodeID` it expected, a strict verifier ensures the
+   fingerprint matches *that specific* entry — not just any trusted
+   peer.
+4. The server's TLS layer accepts any client cert (`RequireAnyClientCert`,
+   `InsecureSkipVerify: true`) because trust is enforced one layer up.
+5. The RPC dispatcher reads the client's cert, computes its
+   fingerprint, and looks it up in the server's `trust.yaml`. If no
+   entry exists, only the `Join` method is permitted.
+6. `Join` performs a constant-time comparison of the inbound
+   `ClusterSecret` against `node.yaml.cluster_secret`. Mismatch →
+   refusal.
+
+So:
+
+- An adversary who gets your **public** cert can't impersonate you.
+- An adversary who gets your **fingerprint** can't impersonate you.
+- An adversary who gets your **private key** *can* impersonate you to
+  any peer that trusts your fingerprint.
+
+## The TOFU step
+
+`qu node add <host:port>` runs a one-shot insecure dial against the
+target (the only place `InsecureBootstrapConfig` is used in the
+codebase, see `internal/transport/tls.go:91`). It fetches the
+remote's cert, prints the fingerprint, and asks for confirmation.
+
+This is **identical** to SSH's first-connection prompt. The operator
+must verify the fingerprint out of band — by running `qu status` on
+the remote side, or by reading `keys/cert.pem` directly, or via a
+known-good distribution channel.
+
+If you skip verification, you trust the network at that moment. If
+the network was MITM'd at exactly that moment, you trust the
+attacker. After the prompt, the cert is pinned and the window closes.
+
+## Cluster secret rotation
+
+There is no built-in command to rotate the cluster secret. The hard
+part isn't generating a new one — it's distributing it consistently
+across every node. The pragmatic recipe:
+
+1. Generate a new secret on one node and copy it to every other node.
+2. Update `node.yaml.cluster_secret` on every node (manual edit).
+3. Restart each daemon one at a time, verifying quorum returns
+   between restarts.
+
+Rotation only protects future `Join` calls, not anything else. If you
+suspect the old secret has been seen by an adversary, also assume any
+peer that was added during the leaked window is compromised, and
+re-init those peers from scratch.
+
+## Identity rotation
+
+To roll a node's RSA keypair (e.g., the private key was on a laptop
+that got stolen):
+
+```sh
+# On the compromised node:
+sudo systemctl stop quptime
+sudo rm -rf /etc/quptime
+sudo -u quptime qu init \
+  --advertise this-host.example.com:9901 \
+  --secret '<existing cluster secret>'
+sudo systemctl start quptime
+
+# On a surviving healthy node:
+sudo -u quptime qu node remove <old-node-id>      # evict the old identity
+sudo -u quptime qu node add this-host.example.com:9901
+```
+
+The new `node_id` is a fresh UUID; the old one is gone for good. Any
+historical references to it (e.g., the `updated_by` field on past
+versions of `cluster.yaml`) are cosmetic.
+
+## What the local control socket protects
+
+`$XDG_RUNTIME_DIR/quptime/quptime.sock` (or `/var/run/quptime/...`) is
+the channel the CLI uses to talk to the local daemon. It's `0600`
+permissioned and authenticated solely by filesystem ACLs — no TLS, no
+secrets in the protocol.
+
+Anyone who can `read+write` the socket can:
+
+- Propose cluster mutations (will be relayed to the master).
+- Read full cluster state including `cluster.yaml`.
+- Trigger test alerts.
+
+So: don't put the daemon's user in a group that other unprivileged
+users share. The default systemd setup with a dedicated `quptime`
+user gets this right.
+
+## Hardening checklist
+
+- [ ] Dedicated `quptime` system user.
+- [ ] Data directory owned by that user, mode 0750.
+- [ ] `keys/private.pem` mode 0600.
+- [ ] `node.yaml` mode 0600.
+- [ ] systemd unit uses `ProtectSystem=strict`, `NoNewPrivileges=true`,
+      and the rest of the hardening directives in
+      [systemd.md](deployment/systemd.md).
+- [ ] If `:9901` is internet-reachable, firewall allow-list to peer
+      IPs or use an overlay — see [public-internet.md](deployment/public-internet.md)
+      and [tailscale.md](deployment/tailscale.md).
+- [ ] Cluster secret generated by `qu init` (not chosen by a human),
+      stored in your secret manager.
+- [ ] Backups of `keys/` and `node.yaml` are encrypted at rest.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
new file mode 100644
index 0000000..c0e6e6d
--- /dev/null
+++ b/docs/troubleshooting.md
@@ -0,0 +1,199 @@
+# Troubleshooting
+
+The cluster is misbehaving. This page is organised by symptom. Each
+entry pairs the user-visible signal with the log line(s) you'll see
+in `journalctl -u quptime` and the fix.
+
+## `qu status` shows `quorum  false`
+
+**What it means.** Fewer than ⌈N/2⌉+1 peers are live.
+
+**Diagnose.** Look at the PEERS table. The `LIVE` column tells you
+which peers this node has stopped hearing from.
+
+- If only this node is "live" and everyone else is not → this node is
+  network-isolated. Test: `nc -zv <peer-advertise>`. Fix: network /
+  firewall.
+- If multiple nodes show false → more than one peer is down. Look at
+  the other peers' status outputs to triangulate.
+- If everyone is live but `quorum false` still → check
+  `cluster.yaml.peers` length vs. live count; you may have phantom
+  peer entries left over from a removed-but-not-evicted node. Fix:
+  `qu node remove <ghost-node-id>` from any live node.
+
+## `qu status` shows `master  (none — ...)`
+
+**What it means.** Either no quorum (see above) or election is in
+flight. The latter clears within ~1 heartbeat.
+
+If `term` is incrementing rapidly (`watch qu status`), the master is
+flapping. Causes:
+
+- The currently-elected master is unreachable from some peers but
+  reachable from others, partial-partition style. Look for log lines
+  on the suspected master about peers it can't reach.
+- Heartbeat timeouts (default 4s) are too tight for your inter-node
+  link. Rebuild with a higher `DefaultDeadAfter` if you need it.
+
+## A check is stuck in `unknown`
+
+**What it means.** The aggregator has no fresh reports for that check.
+
+Possible causes:
+
+- No node is actually running the probe yet. Probes start ~`interval/10`
+  after `qu serve` boots and reconcile every 5s. Wait 10s and
+  re-check.
+- Nodes are submitting results but they're stale (older than 3×
+  interval). Probably means probes are timing out without reporting.
+- This is a follower's view; the aggregator runs on the master only.
+  Check `qu status` on the master to see the canonical view.
+
+## Alerts not firing
+
+Walk this list in order; one of them will catch it:
+
+1. **Is there quorum?** Aggregator runs on master only. No master →
+   no transitions → no alerts.
+2. **Is the alert attached to the check?** `qu status` shows the
+   effective alert list per check. Empty → no alert. Confirm with
+   `qu alert list` that the alert exists and (if relying on default
+   attachment) has `default: true`.
+3. **Is the alert suppressed on this check?** Check
+   `suppress_alert_ids` in `cluster.yaml`.
+4. **Test the alert path directly:**
+
+   ```sh
+   sudo -u quptime qu alert test <name>
+   ```
+
+   This bypasses the aggregator and renders a synthetic transition.
+   If `alert test` doesn't deliver, the problem is the notifier
+   config or the template — see below. If `alert test` works but real
+   transitions don't, the aggregator isn't observing the transition.
+5. **Has the check actually transitioned?** Aggregator commits a flip
+   only after **two consecutive** evaluations agree. A bouncing
+   target may never satisfy the hysteresis. Lower the check interval
+   or increase reliability of the target.
+
+## Discord webhook returns 4xx
+
+The dispatcher logs the HTTP body. Common causes:
+
+- Webhook revoked / channel deleted → 404. Re-issue and update
+  `discord_webhook`.
+- Body too large → 400. Long templates that pull `Snapshot.Detail`
+  with multi-line errors can blow past Discord's 2000-char limit.
+  Shorten the template or trim the variable.
+- Rate-limited → 429. Reduce alert frequency or stop suppressing
+  hysteresis.
+
+## SMTP refuses the message
+
+Check the daemon log for `smtp:` lines. Most common:
+
+- `530 5.7.0 Must issue a STARTTLS command first` → set
+  `smtp_starttls: true` on the alert.
+- `535 Authentication failed` → wrong `smtp_user` / `smtp_password`.
+- Connection refused / timeout → firewall between `qu` and the SMTP
+  relay. Verify with `openssl s_client -starttls smtp -connect host:587`.
+
+## Manual edit to `cluster.yaml` was ignored
+
+Symptoms: you edited the file, saved, nothing happened.
+
+Look for one of these log lines:
+
+- `manual-edit: parse cluster.yaml: <err> — ignoring` → YAML is
+  invalid. The daemon pins the bad hash and waits for the next valid
+  save. Run the file through `yq` or `python -c "import yaml,sys;
+  yaml.safe_load(open(sys.argv[1]))" cluster.yaml` to diagnose.
+- `manual-edit: cluster.yaml changed externally — replicating via
+  master` followed by `manual-edit: forward to master: no quorum` →
+  cluster has no quorum, can't accept the edit. Restore quorum first.
+- *No log line at all* → the on-disk content didn't change in a way
+  that matters. The watcher compares only `peers`, `checks`, and
+  `alerts`; whitespace and comment edits are accepted silently.
+
+## Two nodes disagree on `config ver`
+
+The follower with the lower version should pull within one heartbeat.
+If after ~5 seconds the gap persists:
+
+- The follower might not have an `advertise` address for the higher-
+  versioned peer. The version observer needs one to pull. Check
+  `cluster.yaml.peers` for both sides' `advertise` fields.
+- The follower's TLS handshake against the higher-versioned peer is
+  failing — look for `replicate: pull from <id>: <err>` lines.
+- The peer with the higher version is announcing it correctly but the
+  follower is rejecting the `ApplyClusterCfg` broadcasts because of
+  its own decode error — look for transport-layer errors instead.
+
+## "needs ≥2 live to mutate" rejection during bootstrap
+
+You ran two `qu node add` commands back-to-back and the second one
+failed. The first add doesn't take effect until the new peer sends
+its first heartbeat (≤ 1 second); during that window the cluster has
+size 2 and quorum size 2, so a *second* peer add from a 1-live
+cluster looks like "mutate without quorum."
+
+Fix: pause ~3 seconds between adds. The README and the systemd guide
+both call this out.
+
+## Daemon refuses to start
+
+```
+load node.yaml: open ...: no such file or directory
+```
+
+Run `qu init` before `qu serve`. The daemon does not auto-init —
+silently generating identities and secrets would be a worse failure
+mode than crashing.
+
+```
+node.yaml has empty node_id — run `qu init` first
+```
+
+Same fix.
+
+```
+listen tcp :9901: bind: address already in use
+```
+
+Another process owns the port. `ss -tlnp | grep :9901` to find it.
+
+```
+load private key: ...
+```
+
+Permissions on `keys/private.pem` are wrong — should be 0600 and owned
+by the daemon user. Fix and restart.
+
+## Probes look much slower than expected
+
+ICMP first:
+
+- Default ICMP is **unprivileged UDP-mode pings**, not raw ICMP. UDP
+  ping is a bit slower and may hit different kernel paths. For
+  reference latency, grant `CAP_NET_RAW`.
+
+HTTP / TCP:
+
+- `interval` and `timeout` are the only knobs in `cluster.yaml`. The
+  check is run synchronously per worker; if your target takes 9 s to
+  respond and your timeout is 10 s, the next probe doesn't start
+  until ~9 s elapsed. Increase concurrency by adding more
+  fast-interval checks against the same target, not by lowering
+  timeout (which will just produce false `down` results).
+
+## I want to start over
+
+```sh
+sudo systemctl stop quptime
+sudo rm -rf /etc/quptime
+sudo -u quptime qu init --advertise <addr>
+sudo systemctl start quptime
+```
+
+The data directory is the only state. Wipe it and you're back to a
+fresh node.