AI assisted documentation

2026-05-15 04:05:30 +00:00
parent 364ba222e2
commit 6953709574
12 changed files with 2029 additions and 0 deletions
@@ -27,6 +27,23 @@ definition, can't tell you when it's the one that's down. `qu` solves
 both: run it on a few cheap hosts in different networks and they vote
 on truth. If one of them loses its uplink, the rest keep alerting.
 ## Documentation
 This README is the quick-start. For production use, the longer guides
 live under [`docs/`](docs/README.md):
 | If you want to…                                       | Read                                                               |
 | ----------------------------------------------------- | ------------------------------------------------------------------ |
 | understand the consensus / replication model          | [docs/architecture.md](docs/architecture.md)                       |
 | reference every field in `node.yaml` / `cluster.yaml` | [docs/configuration.md](docs/configuration.md)                     |
 | deploy on Linux with systemd hardening                | [docs/deployment/systemd.md](docs/deployment/systemd.md)           |
 | deploy with Docker / docker-compose                   | [docs/deployment/docker.md](docs/deployment/docker.md)             |
 | deploy over Tailscale or WireGuard                    | [docs/deployment/tailscale.md](docs/deployment/tailscale.md)       |
 | expose `qu` on the open internet safely               | [docs/deployment/public-internet.md](docs/deployment/public-internet.md) |
 | upgrade, back up, or recover from failures            | [docs/operations.md](docs/operations.md)                           |
 | understand the trust model and rotate identities      | [docs/security.md](docs/security.md)                               |
 | diagnose a misbehaving cluster                        | [docs/troubleshooting.md](docs/troubleshooting.md)                 |
 ## Architecture
 ```
@@ -0,0 +1,53 @@
 # QUptime documentation
 Production-oriented documentation for `qu`, a small distributed uptime
 monitor that votes on the health of HTTP/TCP/ICMP targets across a
 cluster of cooperating nodes.
 The top-level `README.md` is the marketing pitch and quick-start. The
 pages here go deeper and are organised by what you're trying to do.
 ## Getting set up
 - [Installation](installation.md) — pre-built binaries, building from
  source, verifying release artifacts, what the install script does.
 - [Configuration](configuration.md) — `node.yaml`, `cluster.yaml`,
  `trust.yaml`, environment variables, file layout, defaults.
 ## Running it
 - [Architecture](architecture.md) — how nodes form quorum, how a master
  is elected, how cluster state replicates, what happens during a
  partition, and exactly which guarantees the design gives you.
 - [Operations](operations.md) — day-2 tasks: upgrades, backups,
  recovery from a lost node, recovery from a lost quorum, monitoring
  `qu` itself.
 - [Security](security.md) — the mTLS / TOFU trust model, what the
  cluster secret protects, how to rotate keys, what to put on a public
  network and what not to.
 - [Troubleshooting](troubleshooting.md) — common failure modes with
  the log lines you'll see and the fix.
 ## Deployment recipes
 Pick the one that matches your environment. They share most of the
 operational guidance — what differs is how `qu` is packaged and how
 the inter-node link is secured at the network layer.
 - [systemd on bare metal / VM](deployment/systemd.md) — single static
  binary, hardened unit file, `CAP_NET_RAW` for ICMP.
 - [Docker / docker-compose](deployment/docker.md) — official image,
  single-node and multi-node compose files, persistent volumes.
 - [Tailscale / WireGuard overlay](deployment/tailscale.md) — nodes in
  separate networks with no public ingress; cluster traffic stays on
  the tailnet.
 - [Public-internet exposure](deployment/public-internet.md) — when
  you have no overlay and `:9901` is reachable from the open
  internet: firewalling, rate-limiting, secret hygiene.
 ## A note on stability
 The wire protocol (`internal/transport`) and the on-disk format
 (`cluster.yaml`, `node.yaml`, `trust.yaml`) are considered stable
 within a minor version. Breaking changes will bump the major version
 and ship with a migration note.
@@ -0,0 +1,196 @@
 # Architecture
 This page is the long-form companion to the diagram in the top-level
 README. Read it if you need to reason about partitions, recovery,
 upgrade ordering, or the consistency guarantees of `qu`.
 ## Components
 A running `qu serve` is one process containing five long-lived
 goroutines plus the listeners:
 | Component       | Package                  | Role                                                                     |
 | --------------- | ------------------------ | ------------------------------------------------------------------------ |
 | Transport       | `internal/transport`     | mTLS listener + dialer, length-prefixed JSON-RPC framing.                |
 | Quorum manager  | `internal/quorum`        | 1 Hz heartbeats, liveness tracking, deterministic master election.       |
 | Replicator      | `internal/replicate`     | Master-routed mutations, version-gated broadcast and pull.               |
 | Scheduler       | `internal/checks`        | One goroutine per check; runs HTTP/TCP/ICMP probes on each node.         |
 | Aggregator      | `internal/checks`        | Master-only. Folds per-node probe results into a cluster-wide verdict.   |
 | Alert dispatch  | `internal/alerts`        | Master-only. Renders templates and ships SMTP / Discord notifications.   |
 | Control socket  | `internal/daemon`        | Local-only unix socket; the CLI and TUI talk to the daemon through it.   |
 Every node runs every component. Whether the master-only ones actually
 *do* anything depends on the result of master election.
 ## Trust and transport
 Inter-node traffic is TLS 1.3 with mutual authentication. There is **no
 central CA**. Each node generates a self-signed RSA cert at `qu init`
 and the SPKI fingerprint of that cert is what other nodes pin against.
 Two layers gate access:
 1. **TLS layer** accepts any client cert. This avoids a chicken-and-egg
   during bootstrap — a brand-new node has no entry in anyone's trust
   store yet, so a strict TLS check would refuse the very first
   handshake.
 2. **RPC dispatcher** rejects every method except `Join` for callers
   whose presented fingerprint is not in `trust.yaml`. So an untrusted
   peer can knock on the door but cannot ask questions.
 `Join` itself is gated by the **cluster secret** — a pre-shared base64
 string generated at `qu init` on the first node. Without it, an
 attacker who can reach `:9901` cannot enrol themselves into the
 cluster.
 The local CLI talks to the daemon over a unix socket with `0600`
 permissions; filesystem ACLs are the only authentication and no TLS is
 used on that channel.
 ## The replicated state machine
 `cluster.yaml` is the single replicated source of truth. It holds three
 editable lists — `peers`, `checks`, `alerts` — plus three
 server-controlled fields:
 ```yaml
 version: 7                 # monotonically increasing
 updated_at: 2026-05-15T...
 updated_by: <node-id>      # master that committed this version
 peers:  [...]
 checks: [...]
 alerts: [...]
 ```
 ### How mutations flow
 1. The CLI (or the manual-edit watcher; see below) issues a mutation
   on the local daemon's control socket.
 2. The daemon's replicator looks at the current quorum view:
   - If there is no quorum, the mutation fails loudly with
     `no quorum: refusing mutation`.
   - If this node is the master, apply locally and broadcast.
   - Otherwise, ship the mutation to the master via the
     `ProposeMutation` RPC and wait for the result.
 3. The master holds the cluster lock, applies the mutation, bumps
   `version`, writes `cluster.yaml` atomically, and broadcasts the new
   snapshot to every peer via `ApplyClusterCfg`.
 4. Each follower's `Replace` accepts the snapshot **only if**
   `incoming.Version > local.Version`. Older or equal versions are
   dropped silently.
 The mutation kinds are enumerated in `internal/transport/messages.go`:
 `add_check`, `remove_check`, `add_alert`, `remove_alert`, `add_peer`,
 `remove_peer`, `replace_config`.
 ### Manual edits to `cluster.yaml`
 Operators can `sudoedit /etc/quptime/cluster.yaml` on any node. Every
 2 seconds the daemon hashes the file. When the on-disk hash diverges
 from the last hash the daemon wrote, the new content is parsed and
 forwarded to the master as a `replace_config` mutation. So a hand-edit
 on a follower still ends up on the master, version-bumped, and
 broadcast everywhere.
 If the parse fails (invalid YAML), the daemon logs and pins the bad
 hash so it doesn't loop. The operator's next valid save unblocks it.
 ## Quorum and master election
 Every node sends a heartbeat to every peer once per second. A peer is
 **live** if a heartbeat (sent or received) was observed within the
 last 4 seconds — comfortably more than three missed beats so a one-tick
 blip does not unseat the master.
 **Quorum** is met when `len(live_peers) >= floor(N/2) + 1` where `N`
 is the total peer count in `cluster.yaml`. Below quorum, the cluster
 refuses every mutation; existing checks continue probing locally but no
 state transitions are committed (the master is the only one who
 aggregates, and there is no master).
 **Master election** is deterministic with no negotiation step: among
 the live members, the master is the one with the lexicographically
 smallest `NodeID`. Every node that observes the same live set picks the
 same master — so there is no split-brain window even during a partial
 partition.
 The `term` integer in `qu status` is bumped every time the elected
 master changes (including transitions to and from "no master"). Use it
 to spot flappy clusters.
 ## Catch-up when a node reconnects
 This is the scenario most people ask about: node C is offline, the
 master commits config version 7, node C comes back online. What
 happens?
 1. Node C's tick loop fires heartbeats every second regardless of its
   previous state. There is no backoff, no give-up.
 2. Each heartbeat carries the sender's `Version`. Each response carries
   the responder's `Version`.
 3. The first time C sees a peer reporting a higher version than its
   own, the version-observer fires and calls
   `replicator.PullFrom(peerID, addr)`.
 4. `PullFrom` does a `GetClusterCfg` RPC against that peer and feeds
   the snapshot through `Replace`, which writes `cluster.yaml`
   atomically and refreshes the on-disk hash so the manual-edit
   watcher doesn't re-fire.
 5. Within ~1 heartbeat C is byte-for-byte identical to the master.
 The same path catches a stale node up when the partition heals on the
 minority side: the minority side cannot mutate, so when it rejoins it
 strictly has the older version, and the pull fires.
 There is one corner case worth knowing about: the pull only fires when
 `peer_version > local_version`. Two nodes at the same version with
 different content would silently diverge — but the design forbids
 that (only the master mutates, and the master is the only one bumping
 the version) unless somebody hand-edits `cluster.yaml` and also
 manually sets `version:`. Don't do that.
 ## Why a check flips state
 The aggregator runs on the master only. Followers' probe results are
 shipped to the master via the `ReportResult` RPC; the master's own
 probe results are submitted directly.
 For each check, the aggregator keeps the latest result per node within
 a freshness window (3× the check interval, minimum 30s). On each
 incoming submission it counts OK vs not-OK across the fresh results:
 - 0 fresh reports → `unknown`
 - more OK than not-OK → `up`
 - more not-OK than OK → `down`
 - tie → `up` (a tie at one report means one node says yes and one says
  no; biasing toward `up` avoids false alerts when nodes disagree
  transiently).
 A state flip is **not** committed immediately. Hysteresis requires the
 candidate state to hold for **two consecutive aggregate evaluations**
 before the state transition fires and the alert dispatcher is called.
 Set in `internal/checks/aggregator.go` as the `HysteresisCount`
 constant — change it there if you want a hair-trigger or a slower
 alert.
 If the master changes, the new master starts the per-check state from
 `unknown` and rebuilds it as fresh results arrive. The first few
 seconds after a re-election can therefore show `unknown` even for
 checks that were `up` a moment ago.
 ## What `qu` does *not* do
 These omissions are intentional in v1 and useful to know up front:
 - **No persistent history.** Only the current aggregate state lives in
  memory. There are no graphs, no SLA reports. Add a sidecar (Prometheus
  exporter, SQLite logger) if you need them.
 - **No automatic key rotation.** Re-init a node and re-trust if you
  need to roll its identity. See [security.md](security.md).
 - **No multi-tenant isolation.** One cluster = one set of checks =
  one alert tree.
 - **No web UI.** Operator surface is `qu` (CLI), `qu tui`, and direct
  edits to `cluster.yaml`.
 - **No automatic peer eviction on prolonged downtime.** A dead peer
  stays in `cluster.yaml` until an operator runs `qu node remove`,
  because that decision affects the quorum size and shouldn't happen
  silently.
@@ -0,0 +1,273 @@
 # Configuration
 This page is the canonical reference for the on-disk files, the
 environment variables, and every field that `qu` reads. It's
 deliberately tedious — when something doesn't behave the way you
 expect, this is where the answer lives.
 ## File layout
 When running as **root** (the typical case under systemd):
 ```
 /etc/quptime/
 ├── node.yaml          identity, never replicated
 ├── cluster.yaml       replicated state
 ├── trust.yaml         local fingerprint trust store
 └── keys/
    ├── private.pem    RSA private key (0600)
    ├── public.pem     RSA public key
    └── cert.pem       self-signed X.509 cert
 /var/run/quptime/quptime.sock   control socket (0600)
 ```
 When running as a **non-root** user (the typical case for `go run` or a
 desktop test):
 ```
 ~/.config/quptime/...                       same shape as /etc/quptime
 $XDG_RUNTIME_DIR/quptime/quptime.sock       control socket
 ```
 Override the data directory with `QUPTIME_DIR=/some/path qu serve`.
 Override the socket path with `QUPTIME_SOCKET=/run/foo.sock`.
 ## Environment variables
 | Variable          | Purpose                                                                                                                   |
 | ----------------- | ------------------------------------------------------------------------------------------------------------------------- |
 | `QUPTIME_DIR`     | Data directory. Defaults to `/etc/quptime` (root) or `$XDG_CONFIG_HOME/quptime`.                                          |
 | `QUPTIME_SOCKET`  | Path to the CLI ↔ daemon unix socket. Defaults to `/var/run/quptime/quptime.sock` (root) or `$XDG_RUNTIME_DIR/quptime/…`. |
 | `XDG_CONFIG_HOME` | Honored when running as non-root and `QUPTIME_DIR` is unset.                                                              |
 | `XDG_RUNTIME_DIR` | Honored when running as non-root and `QUPTIME_SOCKET` is unset.                                                           |
 The daemon does not read any other environment variables. SMTP, Discord,
 and HTTP probe targets are configured exclusively in `cluster.yaml`.
 ## `node.yaml` — local identity
 Never replicated. One file per host. Generated by `qu init`.
 ```yaml
 node_id: 7f3a5b9e-...        # UUIDv4, immutable after init
 bind_addr: 0.0.0.0           # listen address for :9901
 bind_port: 9901              # listen port
 advertise: alpha.example.com:9901   # how peers reach us; may differ from bind
 cluster_secret: 4hZqK8vT9... # base64; required to Join, never replicated
 ```
 ### Field reference
 - `node_id` — UUIDv4 generated at `qu init`. Used by every peer to
  refer to this node across IP changes and restarts. Do not edit.
 - `bind_addr` — Address the daemon listens on. `0.0.0.0` is the
  default. Set to `127.0.0.1` if you only want to expose the daemon
  through an overlay (Tailscale, WireGuard) — see
  [deployment/tailscale.md](deployment/tailscale.md).
 - `bind_port` — Defaults to `9901`. Change here if 9901 is taken; the
  cluster does not require port-uniformity, peers just need to know
  what to dial via the `advertise` field.
 - `advertise` — Host:port other nodes use to reach this one. Must be
  routable from every peer. Falls back to `bind_addr:bind_port` if
  unset, which is rarely what you want behind NAT.
 - `cluster_secret` — Pre-shared base64 string. Required on every
  `Join` RPC; constant-time comparison on the receiver. Generate on
  the first node, distribute out-of-band, keep out of version
  control.
 ### How `qu init` populates this file
 ```sh
 qu init \
  --advertise alpha.example.com:9901 \
  --bind 0.0.0.0 \
  --port 9901 \
  --secret '<paste from first node, or omit on the first node>'
 ```
 Idempotent in one direction only: if `node.yaml` exists, `qu init`
 refuses to overwrite. To re-init, delete the data directory entirely.
 ## `cluster.yaml` — replicated state
 This is the file that every node converges on. The master is the only
 one allowed to bump `version`; followers `Replace` it whole each time
 they receive a higher-versioned snapshot.
 ```yaml
 version: 12
 updated_at: 2026-05-15T14:01:00Z
 updated_by: 7f3a5b9e-...
 peers:
  - node_id: 7f3a5b9e-...
    advertise: alpha.example.com:9901
    fingerprint: SHA256:abcd...
    cert_pem: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----
 checks:
  - id: 0006a1...
    name: homepage
    type: http
    target: https://example.com
    interval: 30s
    timeout: 10s
    expect_status: 200
    alert_ids: [oncall]
    suppress_alert_ids: []
 alerts:
  - id: f001ab...
    name: oncall
    type: discord
    default: true
    discord_webhook: https://discord.com/api/webhooks/...
    body_template: |
      :rotating_light: {{.Check.Name}} is {{.Verb}}
 ```
 ### Top-level fields
 | Field        | Owner    | Notes                                                                              |
 | ------------ | -------- | ---------------------------------------------------------------------------------- |
 | `version`    | master   | Monotonic. Followers reject snapshots whose version is ≤ their local.              |
 | `updated_at` | master   | UTC RFC3339. Cosmetic — humans use it, no logic depends on it.                     |
 | `updated_by` | master   | NodeID of the committing master.                                                   |
 | `peers`      | editable | Cluster members. Edits go through `add_peer` / `remove_peer` mutations.            |
 | `checks`     | editable | Monitored targets.                                                                 |
 | `alerts`     | editable | Notifier destinations.                                                             |
 ### `peers[]`
 ```yaml
 - node_id: 7f3a5b9e-...        # immutable, the peer's own UUID
  advertise: host:port         # how anyone dials this peer
  fingerprint: SHA256:...      # SPKI fingerprint of the peer's cert
  cert_pem: |                  # full PEM so other peers can mTLS without a separate invite
    -----BEGIN CERTIFICATE-----
    ...
 ```
 The `cert_pem` field is what enables N-node clusters without N×(N-1)
 manual invites: when peer X is added via the master, every other node
 that receives the new `cluster.yaml` learns X's cert at the same time
 and adds it to the local trust store. See
 `internal/daemon/daemon.go:syncTrustFromCluster`.
 ### `checks[]`
 ```yaml
 - id: 0006a1...           # UUIDv4, generated when the check is created
  name: homepage          # human-friendly, must be unique within cluster
  type: http              # http | tcp | icmp
  target: https://example.com
  interval: 30s           # Go duration syntax: 5s, 1m30s, 2h
  timeout: 10s            # default 10s
  expect_status: 200      # http only; 0 = accept anything < 400
  body_match: "OK"        # http only; substring match on response body
  alert_ids: [oncall]     # alerts attached explicitly
  suppress_alert_ids: []  # opt out of specific default alerts
 ```
 Defaults:
 - `interval`: 30s
 - `timeout`: 10s
 - `expect_status`: 0 → any 2xx is OK; otherwise the configured status
  must match exactly.
 ICMP checks default to **unprivileged UDP-mode pings** so the daemon
 does not need root. For raw ICMP, grant the capability — see
 [deployment/systemd.md](deployment/systemd.md).
 ### `alerts[]`
 Two notifier kinds, distinguished by `type`:
 ```yaml
 # Discord
 - id: f001ab...
  name: oncall
  type: discord
  default: true              # attach to every check automatically
  discord_webhook: https://...
  body_template: |           # optional Go text/template override
    {{.Check.Name}} is {{.Verb}}
 # SMTP
 - id: f002cd...
  name: ops
  type: smtp
  smtp_host: smtp.example.com
  smtp_port: 587
  smtp_user: mailbot
  smtp_password: '...'
  smtp_from: monitor@example.com
  smtp_to: [ops@example.com]
  smtp_starttls: true
  subject_template: '[{{.Verb}}] {{.Check.Name}}'
  body_template: |
    Check {{.Check.Name}} ({{.Check.Target}}) is now {{.Verb}}.
 ```
 If `default: true`, the alert fires for every check unless the check
 lists the alert's ID or name in `suppress_alert_ids`. Otherwise the
 alert only fires for checks that name it in `alert_ids`.
 Templates are Go `text/template`. The full variable list is in the
 top-level README under "Custom alert messages" — `qu alert add smtp
 --help` and `qu alert add discord --help` print the same table.
 ### Suppression precedence
 For each check, the dispatcher computes the effective alert list as:
 ```
 ( explicit alert_ids ∪ alerts with default=true ) \ suppress_alert_ids
 ```
 de-duplicated by alert ID. So a check can both opt in to specific
 alerts and opt out of specific defaults.
 ## `trust.yaml` — local trust store
 A flat list of fingerprints this node accepts. One entry per peer,
 populated by `qu node add` (or pulled in automatically when a peer's
 cert arrives via the replicated `cluster.yaml`).
 ```yaml
 entries:
  - node_id: 7f3a5b9e-...
    address: alpha.example.com:9901
    fingerprint: SHA256:...
    cert_pem: |
      -----BEGIN CERTIFICATE-----
      ...
 ```
 Never edit this by hand. Use `qu trust list` and `qu trust remove`.
 ## Key material
 `keys/private.pem` is the only secret on disk besides
 `node.yaml.cluster_secret`. It's chmod 0600 by default; preserve that.
 The public cert at `keys/cert.pem` is what gets fingerprinted and
 shipped in `cluster.yaml.peers[].cert_pem`.
 There is **no automatic key rotation**. Rolling a node's identity
 means wiping its data directory, running `qu init` again, and
 re-adding it from another node as a fresh peer.
 ## Tunables that don't live in YAML
 A few values are compiled constants. Change them in source and rebuild
 if you need different behaviour.
 | Constant                                              | Default | What it does                                                  |
 | ----------------------------------------------------- | ------- | ------------------------------------------------------------- |
 | `quorum.DefaultHeartbeatInterval`                     | `1s`    | How often each node heartbeats every peer.                    |
 | `quorum.DefaultDeadAfter`                             | `4s`    | A peer is dead if no heartbeat is seen within this window.    |
 | `checks.HysteresisCount`                              | `2`     | Consecutive aggregate evaluations needed before a state flip. |
 | `checks.ReconcileInterval`                            | `5s`    | How often the scheduler reconciles its workers vs `checks[]`. |
 | `daemon.manualEditPollInterval` (`internal/daemon/watcher.go`) | `2s`    | How often the daemon hashes `cluster.yaml` for hand edits.    |
@@ -0,0 +1,198 @@
 # Deployment: Docker / docker-compose
 The published image is a 14 MB distroless static container with the
 `qu` binary as the entrypoint. It runs as root by default so the
 daemon can bind privileged ports and open ICMP sockets; override with
 `--user` if your host doesn't need that.
 ## Image references
 ```
 git.cer.sh/axodouble/quptime:master          # tip of main, multi-arch
 git.cer.sh/axodouble/quptime:v0.1.0          # tagged release
 git.cer.sh/axodouble/quptime:v0.1.0-amd64    # single-arch (if you must pin)
 ```
 The image embeds `QUPTIME_DIR=/etc/quptime` and declares it a volume —
 treat it as the only piece of state worth persisting.
 ## Single-node, single-container compose
 For a development cluster or a single-node smoke test:
 ```yaml
 # compose.yaml
 services:
  quptime:
    image: git.cer.sh/axodouble/quptime:v0.1.0
    container_name: quptime
    restart: unless-stopped
    ports:
      - "9901:9901"
    volumes:
      - quptime-data:/etc/quptime
    # ICMP UDP-mode pings need a permissive sysctl on the host:
    #   sysctl net.ipv4.ping_group_range="0 2147483647"
    # Or grant CAP_NET_RAW (more accurate, raw ICMP).
    cap_add:
      - NET_RAW
 volumes:
  quptime-data:
 ```
 You must **`qu init` before the daemon will start**. With this compose
 file:
 ```sh
 docker compose run --rm quptime init --advertise <host-ip>:9901
 docker compose up -d
 docker compose exec quptime qu status
 ```
 `<host-ip>` must be reachable from every other node — the loopback
 address inside the container is useless to peers.
 ## Three-node compose on a single host
 For local testing of the full quorum machinery without three machines:
 ```yaml
 # compose.yaml
 x-quptime: &quptime
  image: git.cer.sh/axodouble/quptime:v0.1.0
  restart: unless-stopped
  cap_add:
    - NET_RAW
 services:
  alpha:
    <<: *quptime
    container_name: alpha
    ports: ["9901:9901"]
    volumes: ["alpha-data:/etc/quptime"]
  bravo:
    <<: *quptime
    container_name: bravo
    ports: ["9902:9901"]
    volumes: ["bravo-data:/etc/quptime"]
  charlie:
    <<: *quptime
    container_name: charlie
    ports: ["9903:9901"]
    volumes: ["charlie-data:/etc/quptime"]
 volumes:
  alpha-data:
  bravo-data:
  charlie-data:
 ```
 Bootstrap:
 ```sh
 # First node: prints the secret to stdout.
 docker compose run --rm alpha init --advertise alpha:9901
 # Capture the secret (or read it back from alpha-data).
 SECRET=$(docker compose exec alpha cat /etc/quptime/node.yaml | grep cluster_secret | awk '{print $2}')
 docker compose run --rm bravo   init --advertise bravo:9901   --secret "$SECRET"
 docker compose run --rm charlie init --advertise charlie:9901 --secret "$SECRET"
 docker compose up -d
 # Invite from alpha. The hostnames resolve over the compose network.
 docker compose exec alpha qu node add bravo:9901
 sleep 3   # wait for heartbeats before the next add
 docker compose exec alpha qu node add charlie:9901
 docker compose exec alpha qu status
 ```
 For a cluster on three separate hosts, replicate the compose file on
 each box with different `advertise` addresses (the public hostname or
 the overlay IP) and bootstrap the same way.
 ## Multi-host compose
 The natural unit is one compose file per host, each running one
 `qu` container. The minimum-viable file per host:
 ```yaml
 # /etc/qu-stack/compose.yaml
 services:
  quptime:
    image: git.cer.sh/axodouble/quptime:v0.1.0
    container_name: quptime
    restart: unless-stopped
    ports:
      - "9901:9901"
    volumes:
      - /srv/quptime/data:/etc/quptime
    cap_add:
      - NET_RAW
 ```
 Persistence is a bind-mount under `/srv/quptime/data` so backups and
 upgrades hit a known path. See [operations.md](../operations.md) for
 the backup recipe.
 Inter-host traffic on TCP/9901 must be reachable. If the boxes don't
 share a private network, prefer the
 [Tailscale recipe](tailscale.md) over exposing 9901 directly — see
 [public-internet.md](public-internet.md) for the threat model if you
 must expose it.
 ## Behind a reverse proxy
 **Don't.** `qu` is mTLS-pinned at the application layer, so a TLS-
 terminating proxy would force the daemon to trust whatever cert the
 proxy presents — defeating fingerprint pinning. If you need a single
 public address per node, use a Layer 4 TCP proxy (`nginx stream`,
 HAProxy `mode tcp`, or a plain firewall NAT) that forwards bytes
 without touching them.
 ## Image internals
 Build locally if you want to inspect what you're running:
 ```sh
 docker buildx build \
  --build-arg VERSION=$(git describe --tags --always) \
  --platform linux/amd64,linux/arm64 \
  --file docker/Dockerfile \
  --tag quptime:dev \
  --load \
  .
 ```
 The Dockerfile (see `docker/Dockerfile`) is two stages: a `golang:1.24-alpine`
 builder that cross-compiles with `-trimpath -ldflags "-s -w"`, and a
 `gcr.io/distroless/static-debian12` runtime. No shell, no package
 manager, no SSH; you cannot `docker exec -it sh` into it. Use
 `docker exec quptime qu ...` for everything.
 ## Healthcheck
 The container exits non-zero if the daemon crashes, so the default
 `restart: unless-stopped` policy is enough for liveness. A more
 useful readiness check requires the binary to be in your healthchecker:
 ```yaml
 healthcheck:
  test: ["CMD", "/usr/local/bin/qu", "status"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 10s
 ```
 `qu status` exits 0 when the daemon socket is reachable and the
 control RPC succeeds — it does **not** fail on quorum loss. That's
 intentional: restarting a quorum-less node won't bring quorum back,
 and a healthcheck that flaps a follower in and out of `unhealthy`
 state every time the master is briefly unreachable is worse than no
 check. If you want a stricter readiness signal, pipe `qu status`
 through `grep -q 'quorum     true'`.
@@ -0,0 +1,180 @@
 # Deployment: public-internet exposure
 If your nodes do not share a private network and you can't put an
 overlay between them (see [tailscale.md](tailscale.md)), this is the
 recipe for exposing TCP/9901 directly to the open internet without
 losing sleep.
 The short version: `qu` is designed for this — every inbound call is
 mTLS-pinned at the application layer and gated by the cluster secret
 — but defence in depth is cheap and you should take it.
 ## Threat model in one paragraph
 Anyone on the internet can establish a TLS connection to `:9901`
 because the daemon must accept handshakes from currently-untrusted
 peers (otherwise no node could ever join). The RPC dispatcher then
 rejects every method except `Join` for callers whose fingerprint
 isn't in `trust.yaml`. `Join` itself is gated by the **cluster
 secret**, compared in constant time. So the realistic attack surface
 is:
 1. The TLS 1.3 stack accepting handshakes from arbitrary peers.
 2. The `Join` handler's secret check and downstream cert ingestion.
 3. The blast radius of a leaked cluster secret (an attacker who has
   it can enrol themselves as a peer and propose mutations, which is
   game over).
 What can't trivially happen:
 - A random attacker observing or modifying cluster traffic — TLS 1.3
  with fingerprint pinning sees to that.
 - A random attacker calling any method other than `Join` — the RPC
  dispatcher refuses.
 What you should still do:
 - Treat `node.yaml.cluster_secret` like an SSH host key. Out-of-band
  distribution only. Never in git, never in CI logs, never in chat.
 - Rate-limit and IP-allowlist where you can. The `Join` handler does
  not currently rate-limit at the application layer, so a determined
  attacker could try secrets at TLS-handshake rate.
 - Run on a non-default port if your operations workflow allows it.
  Doesn't add security, but reduces background internet noise in the
  logs and makes IDS / WAF rules cleaner.
 ## Firewall
 ### nftables (recommended)
 A drop-in `/etc/nftables.d/quptime.nft`:
 ```nft
 table inet filter {
  set quptime_peers {
    type ipv4_addr
    elements = { 198.51.100.10, 198.51.100.11, 198.51.100.12 }
  }
  chain quptime_input {
    # Drop everything that didn't come from a known peer.
    ip saddr @quptime_peers tcp dport 9901 accept
    tcp dport 9901 log prefix "quptime-drop: " level info drop
  }
  chain input {
    type filter hook input priority 0; policy drop;
    ct state established,related accept
    iif lo accept
    jump quptime_input
    # ... your other rules
  }
 }
 ```
 The allowlist is the highest-ROI mitigation by far — if you maintain
 fixed IPs for your monitor nodes, use this and move on.
 ### ufw
 ```sh
 sudo ufw allow from 198.51.100.10 to any port 9901 proto tcp
 sudo ufw allow from 198.51.100.11 to any port 9901 proto tcp
 sudo ufw allow from 198.51.100.12 to any port 9901 proto tcp
 ```
 ### Dynamic peer IPs
 If peer IPs aren't fixed (e.g., one node is on a home connection with
 a rotating address), you have three options ranked by preference:
 1. Use an overlay instead — see [tailscale.md](tailscale.md). This is
   the right answer.
 2. DNS-based allowlisting (`ipset`-from-DNS or a small reconciler that
   re-resolves an allowlist hostname every minute). Beware: a
   compromised DNS resolver becomes a compromise of the allowlist.
 3. Drop the allowlist and rely solely on the cluster secret + mTLS.
   This is what `qu` is designed to survive; just be sure the secret
   actually has the entropy `qu init` generated for it (32 random
   bytes, base64-encoded).
 ## Rate-limiting failed handshakes
 `qu` does not currently rate-limit `Join` attempts at the application
 layer. You can do it at the firewall, which catches both connect
 floods and slow brute-force:
 ```nft
 table inet filter {
  chain quptime_input {
    tcp dport 9901 ct state new \
      meter quptime_ratemeter { ip saddr limit rate over 10/second } \
      log prefix "quptime-rate: " drop
    tcp dport 9901 accept
  }
 }
 ```
 Or `fail2ban` with a tiny custom filter that watches `journalctl -u
 quptime` for repeated `peer rejected join` lines:
 ```ini
 # /etc/fail2ban/filter.d/quptime.conf
 [Definition]
 failregex = ^.*quptime:.*peer rejected join.*from <ADDR>.*$
 ```
 ```ini
 # /etc/fail2ban/jail.d/quptime.local
 [quptime]
 enabled  = true
 filter   = quptime
 backend  = systemd
 journalmatch = _SYSTEMD_UNIT=quptime.service
 maxretry = 3
 findtime = 600
 bantime  = 86400
 ```
 Note: the daemon doesn't currently log the *peer address* on rejected
 joins. The log filter above is illustrative; check what your version
 actually emits before relying on it.
 ## Secret hygiene
 The single most important thing on a public-internet deployment:
 - **Generate the secret on the first node.** `qu init` with no
  `--secret` produces 32 random bytes from `crypto/rand`, base64-
  encoded. Don't replace that with something memorable.
 - **Transport out of band.** Paste it into your secret manager
  immediately; share via 1Password / Vault / encrypted email.
 - **Rotate if anyone with access has left.** Rotation isn't a CLI
  command; do it the brute-force way: `qu init` a fresh cluster on
  new ports, re-add every check via `cluster.yaml` export, swap DNS.
 - **One secret per cluster.** Do not reuse the secret across staging
  and prod, or across customers if you run several clusters.
 ## Non-default ports
 ```sh
 # Each node, in node.yaml — or pass --port on init.
 qu init --advertise alpha.example.com:51234 --port 51234
 ```
 Open the corresponding firewall rule, restart the daemon. The
 cluster doesn't require uniform ports across nodes; each peer's
 `advertise` field tells everyone else what to dial.
 ## What you should monitor on a public deployment
 - `term` from `qu status` — if it's ticking up frequently the master
  is flapping, which probably means at least one peer's network is
  unstable. Could be benign, could be a probe attempt.
 - The firewall drop counter on the `quptime-drop` rule above.
 - The number of TLS handshakes on `:9901`. A spike in handshakes that
  don't progress to a successful RPC is the signature of a brute-force
  on the cluster secret.
 For the operational side — backups, upgrades, recovery — see
 [operations.md](../operations.md).
@@ -0,0 +1,250 @@
 # Deployment: systemd on bare metal / VM
 The canonical way to run `qu` on a Linux host. Single static binary,
 managed by systemd, with a hardened unit file. Most production users
 should start here.
 ## Audience and assumptions
 - You have root (or `sudo`) on the host.
 - You have at least three hosts that can reach each other on TCP/9901.
  (Three is the minimum for a useful quorum; fewer is fine for
  development but a 2-node cluster offers no consensus protection.)
 - The hosts have a way to authenticate each other — direct IP or a
  resolvable hostname is fine. For overlay networks see
  [tailscale.md](tailscale.md).
 ## Install the binary
 See [installation.md](../installation.md). The official `install.sh`
 script writes a *minimal* unit file that's fine for development. For
 production replace it with the hardened version below.
 ## Create a dedicated user
 Running as a dedicated unprivileged user is best practice, but ICMP
 support adds a wrinkle — see the next section.
 ```sh
 sudo useradd --system --no-create-home --shell /usr/sbin/nologin quptime
 sudo install -d -o quptime -g quptime -m 0750 /etc/quptime
 sudo install -d -o quptime -g quptime -m 0750 /var/run/quptime
 ```
 ## ICMP capabilities
 ICMP probes have two implementations:
 1. **Unprivileged UDP pings** — Linux's `dgram` ICMP socket. Works on
   any modern kernel without elevated privileges, but only if
   `net.ipv4.ping_group_range` includes the daemon's GID. This is the
   default in `qu`.
 2. **Raw ICMP** — requires `CAP_NET_RAW`, more accurate latency
   numbers and works for IPv6 from arbitrary kernels.
 The simplest path: stick with unprivileged pings and widen
 `ping_group_range`. Sysctl, persistent across reboots:
 ```sh
 # /etc/sysctl.d/10-quptime.conf
 net.ipv4.ping_group_range = 0 2147483647
 ```
 ```sh
 sudo sysctl --system
 ```
 If you need raw ICMP instead, grant the capability on the binary:
 ```sh
 sudo setcap cap_net_raw=+ep /usr/local/bin/qu
 ```
 Note that `setcap` is overwritten by every `qu` upgrade — bake the
 `setcap` call into your deploy script, or re-run it after each
 package update.
 ## Hardened unit file
 Drop this in `/etc/systemd/system/quptime.service`:
 ```ini
 [Unit]
 Description=QUptime distributed uptime monitor
 Documentation=https://git.cer.sh/axodouble/quptime
 Wants=network-online.target
 After=network-online.target
 [Service]
 Type=simple
 ExecStart=/usr/local/bin/qu serve
 Restart=always
 RestartSec=5s
 User=quptime
 Group=quptime
 # Where state lives. RuntimeDirectory creates /var/run/quptime/ each
 # boot owned by User:Group with mode 0750.
 Environment=QUPTIME_DIR=/etc/quptime
 RuntimeDirectory=quptime
 RuntimeDirectoryMode=0750
 ReadWritePaths=/etc/quptime /var/run/quptime
 # Hardening. Comment out individual directives if a probe needs
 # something we've revoked.
 NoNewPrivileges=true
 ProtectSystem=strict
 ProtectHome=true
 PrivateTmp=true
 PrivateDevices=true
 ProtectKernelTunables=true
 ProtectKernelModules=true
 ProtectControlGroups=true
 ProtectClock=true
 ProtectHostname=true
 RestrictNamespaces=true
 RestrictRealtime=true
 RestrictSUIDSGID=true
 LockPersonality=true
 MemoryDenyWriteExecute=true
 # Network access is required (we're a network monitor). Keep address
 # families minimal — AF_NETLINK is needed for some libc lookups.
 RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 AF_NETLINK
 # If you need raw ICMP, *also* uncomment:
 # AmbientCapabilities=CAP_NET_RAW
 # CapabilityBoundingSet=CAP_NET_RAW
 # Otherwise drop all capabilities:
 CapabilityBoundingSet=
 [Install]
 WantedBy=multi-user.target
 ```
 Reload systemd and enable:
 ```sh
 sudo systemctl daemon-reload
 sudo systemctl enable quptime.service
 ```
 ## Initialise the node
 **Don't start the service yet** — `qu init` must run first, and it
 must run as the `quptime` user so it creates files with the right
 ownership.
 On the **first** host (it will print a secret; copy it):
 ```sh
 sudo -u quptime QUPTIME_DIR=/etc/quptime \
  qu init --advertise alpha.example.com:9901
 ```
 On every **other** host (paste the secret):
 ```sh
 sudo -u quptime QUPTIME_DIR=/etc/quptime \
  qu init --advertise bravo.example.com:9901 --secret '<paste>'
 sudo -u quptime QUPTIME_DIR=/etc/quptime \
  qu init --advertise charlie.example.com:9901 --secret '<paste>'
 ```
 ## Open the firewall
 `qu` needs TCP/9901 reachable between cluster members. Adjust to your
 firewall:
 ```sh
 # ufw
 sudo ufw allow from <peer-ip> to any port 9901 proto tcp
 # firewalld
 sudo firewall-cmd --permanent --zone=internal \
  --add-rich-rule='rule family=ipv4 source address=<peer-ip> port port=9901 protocol=tcp accept'
 sudo firewall-cmd --reload
 # nftables (drop-in)
 table inet filter {
  chain input {
    ip saddr { 10.0.0.10, 10.0.0.11, 10.0.0.12 } tcp dport 9901 accept
  }
 }
 ```
 For exposing 9901 to the open internet see
 [public-internet.md](public-internet.md).
 ## Start the daemon
 ```sh
 sudo systemctl start quptime
 sudo systemctl status quptime
 journalctl -u quptime -f
 ```
 ## Invite peers
 From one node (typically `alpha`):
 ```sh
 sudo -u quptime qu node add bravo.example.com:9901
 # Pause a few seconds so heartbeats reach the new peer before the next add —
 # otherwise the "needs ≥2 live to mutate" check rejects the second invite.
 sudo -u quptime qu node add charlie.example.com:9901
 ```
 `qu node add` prints each remote's fingerprint and asks for SSH-style
 confirmation. Verify it matches an out-of-band channel (the remote
 operator can show their fingerprint with
 `sudo -u quptime qu status` or by reading `trust.yaml`).
 ## Verify
 ```sh
 sudo -u quptime qu status
 ```
 Expect to see all three peers `live=true` and one of them as
 `master`.
 ## Log scraping
 `journalctl -u quptime` is the canonical log stream. Notable lines:
 | Pattern                                                       | Meaning                                                   |
 | ------------------------------------------------------------- | --------------------------------------------------------- |
 | `listening on ... as node ...`                                | Daemon up.                                                |
 | `manual-edit: cluster.yaml changed externally — replicating…` | An operator edited `cluster.yaml` directly.               |
 | `manual-edit: parse cluster.yaml: ...`                        | Invalid YAML on disk; the operator must fix and re-save.  |
 | `report to master ...: <err>`                                 | A follower couldn't ship a probe result to the master.    |
 | `replicate: pull from ...: <err>`                             | A follower couldn't pull a higher-version config snapshot. |
 ## Sample reload / restart drill
 After editing the unit file:
 ```sh
 sudo systemctl daemon-reload
 sudo systemctl restart quptime
 ```
 After editing `cluster.yaml` by hand:
 ```sh
 sudoedit /etc/quptime/cluster.yaml
 # No restart needed — the watcher picks it up within 2s and pushes to master.
 ```
 After upgrading the binary:
 ```sh
 sudo install -m 0755 qu-new /usr/local/bin/qu
 sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
 sudo systemctl restart quptime
 ```
 Doing rolling upgrades? See [operations.md](../operations.md).
@@ -0,0 +1,181 @@
 # Deployment: Tailscale / WireGuard overlay
 When your nodes live in different networks — different VPS providers,
 different physical sites, a mix of home and cloud — exposing TCP/9901
 to the open internet is a poor idea. An overlay network gives every
 node a stable private IP regardless of NAT, and `qu` only needs to
 listen on that overlay address.
 This page focuses on Tailscale because the repo ships an example
 compose for it, but everything generalises to WireGuard, Nebula, or a
 self-hosted Headscale.
 ## The big idea
 ```
 +--- host A (VPS, no public ICMP) ----+
 | tailscale ←→ overlay ip 100.64.1.1  |
 | qu listening on 100.64.1.1:9901     |
 +-------------------------------------+
              │   mTLS over overlay
              ▼
 +--- host B (homelab behind NAT) -----+
 | tailscale ←→ overlay ip 100.64.1.2  |
 | qu listening on 100.64.1.2:9901     |
 +-------------------------------------+
 ```
 `bind_addr` is set to the tailscale IP, the host's public interface
 has no port 9901 open, and the cluster secret + mTLS handshake gate
 the link inside the tunnel.
 ## Compose recipe
 The repo ships [`docker/docker-compose-tailscale.yml`](../../docker/docker-compose-tailscale.yml).
 The relevant trick is `network_mode: "service:tailscale"` — the
 `quptime` container shares the network namespace of the `tailscale`
 sidecar so it sees the tailnet as its own interface.
 ```yaml
 services:
  tailscale:
    image: tailscale/tailscale:latest
    container_name: tailscale
    cap_add: [NET_ADMIN]
    environment:
      - TS_AUTHKEY=${TAILSCALE_AUTHKEY}   # provision via .env
      - TS_HOSTNAME=quptime-${HOST}       # name visible in admin
    volumes:
      - /dev/net/tun:/dev/net/tun
      - tailscale:/var/lib/tailscale
    restart: unless-stopped
  quptime:
    image: git.cer.sh/axodouble/quptime:v0.1.0
    container_name: quptime
    volumes:
      - quptime:/etc/quptime
    network_mode: "service:tailscale"
    depends_on: [tailscale]
    cap_add: [NET_RAW]
    # No restart directive yet — needs `qu init` first.
 volumes:
  tailscale:
  quptime:
 ```
 ### One-time bootstrap
 Each host runs the same script with different `HOST` and `TAILSCALE_AUTHKEY`:
 ```sh
 # .env
 HOST=alpha
 TAILSCALE_AUTHKEY=tskey-auth-xxxxxxxx
 ```
 Start Tailscale alone first so it gets an IP:
 ```sh
 docker compose up -d tailscale
 sleep 5
 TSIP=$(docker compose exec tailscale tailscale ip --4)
 echo "this node's tailnet IP: $TSIP"
 ```
 On the **first** host, init without `--secret`:
 ```sh
 docker compose run --rm quptime init --advertise "$TSIP:9901"
 # Grab the printed secret; pipe through your password manager.
 ```
 On every **other** host, paste the secret:
 ```sh
 docker compose run --rm quptime init \
  --advertise "$TSIP:9901" \
  --secret "$CLUSTER_SECRET"
 ```
 Then bring up `qu` on every node and invite from the first:
 ```sh
 # Each host
 docker compose up -d quptime
 # From alpha
 docker compose exec quptime qu node add 100.64.1.2:9901
 sleep 3
 docker compose exec quptime qu node add 100.64.1.3:9901
 docker compose exec quptime qu status
 ```
 ## Tailscale ACLs
 Belt and braces — even though mTLS pins identities, lock down the
 tailnet itself so only the `qu` nodes can reach each other's :9901.
 In the Tailscale admin console:
 ```jsonc
 {
  "tagOwners": { "tag:qu-node": ["group:ops"] },
  "acls": [
    {
      "action": "accept",
      "src": ["tag:qu-node"],
      "dst": ["tag:qu-node:9901"]
    }
    // ...your other rules
  ]
 }
 ```
 Then tag every `qu` node in its auth key:
 ```yaml
 environment:
  - TS_AUTHKEY=${TAILSCALE_AUTHKEY}?ephemeral=false&tags=tag:qu-node
 ```
 ## WireGuard / Nebula / Headscale equivalents
 The recipe generalises:
 1. Provision the overlay interface on each host with a stable
   private IP (the tunnel's own address).
 2. `qu init --advertise <overlay-ip>:9901`.
 3. Set `bind_addr: <overlay-ip>` in `node.yaml` so the daemon does
   **not** also listen on the public interface.
 4. Open `:9901` only on the overlay interface in your firewall — for
   nftables that's something like `iifname "wg0" tcp dport 9901
   accept`.
 The cluster secret and mTLS fingerprints still apply; the overlay just
 removes the open-internet attack surface.
 ## Why prefer overlay over public exposure
 - Single failure domain at the network layer: an attacker who finds an
  exploit in your overlay client (rare; Tailscale and WireGuard are
  small surfaces) still hits the application-layer pinning before any
  cluster-level operation.
 - The cluster secret can be lower-entropy when it's already
  unreachable from outside. (You should still treat it as a real
  secret; "defence in depth" only works if every layer is real.)
 - ICMP probes from a homelab to a target on the public internet are
  trivial through NAT, but ICMP *into* a homelab usually isn't.
  Running `qu` on a tailnet means peers can heartbeat each other
  regardless of NAT direction.
 ## Trade-offs
 - One more thing to monitor. If your tailnet is down, your monitor is
  down. Counter-measure: run *another* tiny `qu` cluster (or a single
  node) on the public internet that watches the overlay's coordinator
  health.
 - Probe latency includes the overlay's hop. Tailscale's wireguard is
  fast (<1 ms LAN, single-digit ms WAN) so this rarely matters, but
  if you're alerting on tight latency thresholds, account for it.
@@ -0,0 +1,104 @@
 # Installation
 `qu` ships as a single static Linux binary. Pick whichever method
 matches how you manage software on the host.
 > Choosing a deployment recipe instead? Jump to
 > [systemd](deployment/systemd.md), [Docker](deployment/docker.md),
 > [Tailscale](deployment/tailscale.md), or
 > [public-internet](deployment/public-internet.md).
 ## Pre-built binary (recommended)
 Releases are published to the [Gitea releases
 page](https://git.cer.sh/axodouble/quptime/releases) with a
 `SHA256SUMS` file. Two architectures are built: `linux-amd64` and
 `linux-arm64`.
 ```sh
 # Always pin to a tag — `latest` resolves on the server side.
 TAG=v0.1.0
 ARCH=amd64   # or arm64
 curl -fSL -o qu \
  "https://git.cer.sh/axodouble/quptime/releases/download/${TAG}/qu-${TAG}-linux-${ARCH}"
 curl -fSL -o SHA256SUMS \
  "https://git.cer.sh/axodouble/quptime/releases/download/${TAG}/SHA256SUMS"
 # Verify before installing.
 sha256sum --check --ignore-missing SHA256SUMS
 install -m 0755 qu /usr/local/bin/qu
 ```
 ## One-line install script
 The repo ships an `install.sh` that handles the download, checksum,
 shell-completion installation, and a default systemd unit file. Run it
 under `sudo` so it can write to `/usr/local/bin` and
 `/etc/systemd/system`.
 ```sh
 curl -fsSL https://git.cer.sh/Axodouble/QUptime/raw/branch/master/install.sh | sudo bash
 ```
 What it does:
 1. Looks up the latest release via the Gitea API.
 2. Downloads the binary to `/usr/local/bin/qu`.
 3. Installs bash / zsh / fish completion if a target directory exists.
 4. Writes `/etc/systemd/system/qu-serve.service` and enables it (but
   does **not** start it — you need to run `qu init` first).
 The unit it writes is minimal. For a production unit with hardening,
 see the [systemd deployment guide](deployment/systemd.md).
 ## Build from source
 Requires Go 1.24.2 or newer.
 ```sh
 git clone https://git.cer.sh/axodouble/quptime.git
 cd quptime
 go build -ldflags "-X main.version=$(git describe --tags --always)" -o qu ./cmd/qu
 ./qu --version
 ```
 Static binary, no cgo. `CGO_ENABLED=0` is the default on a clean Go
 install; if you've enabled cgo globally, set it explicitly:
 ```sh
 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" -o qu ./cmd/qu
 ```
 ## Docker image
 A multi-arch (`amd64` + `arm64`) image is published to the Gitea
 registry on every tag and every push to `master`:
 ```
 git.cer.sh/axodouble/quptime:master   # tip of main
 git.cer.sh/axodouble/quptime:v0.1.0   # tagged release
 ```
 See the [Docker deployment guide](deployment/docker.md) for compose
 files and volume layout.
 ## Verifying the install
 ```sh
 qu --version
 qu --help
 ```
 If completions installed, `qu <tab>` will list subcommands. After
 `qu init` you can run `qu status` to confirm the daemon is reachable
 over its control socket.
 ## Next steps
 - [Configure the node and the cluster](configuration.md).
 - Pick a deployment recipe under [docs/deployment/](deployment/).
 - Walk through the [architecture](architecture.md) so the operational
  guarantees are clear before you commit to a topology.
@@ -0,0 +1,225 @@
 # Operations
 Day-2 tasks: keeping `qu` healthy, upgrading without dropping checks,
 backing up state, recovering from failures. Pair this with
 [troubleshooting.md](troubleshooting.md) for "the cluster is on fire,
 what now" specifics.
 ## Upgrades
 ### Rolling upgrade (zero alert loss)
 `qu` is built to tolerate one node being absent at a time as long as
 quorum still holds. The simple recipe for a 3-node cluster:
 ```sh
 # On each node in turn:
 sudo systemctl stop quptime
 sudo install -m 0755 qu-new /usr/local/bin/qu
 sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
 sudo systemctl start quptime
 # Wait for the node to rejoin before moving on:
 sudo -u quptime qu status   # should show quorum true, all peers live
 ```
 The first node you upgrade may briefly be a follower with a *higher*
 binary version than the master. That's fine as long as no on-disk
 format changes; the wire protocol and `cluster.yaml` schema are
 stable within a minor version, so minor / patch upgrades freely
 interleave.
 For major-version upgrades that change the on-disk format, the release
 notes will spell out the migration. As of v0 there have been none.
 ### Downgrades
 A node that downgrades to an older binary will refuse to start if
 `cluster.yaml` contains fields the older version doesn't know. To
 roll back across a schema change, either:
 - Take the cluster offline and downgrade all nodes simultaneously.
 - Restore a `cluster.yaml` from before the schema change on every node
  before starting the downgraded binary.
 Within a single minor version, downgrade is symmetrical with upgrade.
 ### What can go wrong
 - **Restarting two nodes at once in a 3-node cluster** loses quorum.
  No mutations succeed, no alerts fire. Quorum returns the moment
  the second node is back.
 - **A node that has been offline for a long time** comes back with a
  stale `cluster.yaml`. It will pull the master's higher version
  within ~1 heartbeat. Don't pre-emptively delete its `cluster.yaml`
  — let the catch-up path handle it.
 ## Backups
 Three files matter, in descending order of "pain if lost":
 | File                   | Why back it up                                                       |
 | ---------------------- | -------------------------------------------------------------------- |
 | `node.yaml`            | Holds the cluster secret. Lose it and the node can't rejoin.         |
 | `keys/private.pem`     | Lose it and you must `qu init` a fresh identity and re-trust.        |
 | `cluster.yaml`         | Resyncs from any other live peer, so per-node backup is optional.    |
 ### Per-host backup
 ```sh
 # /etc/cron.daily/quptime-backup
 #!/bin/sh
 set -eu
 dst=/var/backups/quptime/$(date +%Y%m%d)
 mkdir -p "$dst"
 cp -a /etc/quptime/node.yaml         "$dst/"
 cp -a /etc/quptime/keys              "$dst/keys"
 cp -a /etc/quptime/cluster.yaml      "$dst/cluster.yaml"
 chmod -R go-rwx "$dst"
 ```
 ### Cluster-wide backup
 The cluster state (`peers`, `checks`, `alerts`) is identical across
 every node. Back up one healthy node's `cluster.yaml` and you have
 the canonical copy. To restore:
 ```sh
 # Stop the daemon.
 sudo systemctl stop quptime
 # Drop in the backup. Reset the version to 0 so the running cluster's
 # higher version supersedes whatever you're holding — otherwise this
 # node will broadcast a stale snapshot and confuse everyone.
 sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml
 sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml
 sudo systemctl start quptime
 # Within seconds the version-observer pulls the live version from a peer.
 ```
 If you're restoring **the entire cluster** (every node lost), the
 "reset version to 0" trick doesn't apply — there's no peer with a
 higher version. Pick the highest-version backup, restore that file
 across every node verbatim, and start the daemons. The cluster will
 elect a master and continue.
 ## Replacing a dead node
 A node has died permanently. You want to add a fresh box with the
 same role.
 1. On a surviving node, evict the dead one:
   ```sh
   sudo -u quptime qu node remove <dead-node-id>
   ```
   This drops it from `cluster.yaml` and removes its trust entry. The
   live set's size shrinks by one — verify quorum still holds.
 2. On the new host, install `qu` and `qu init` against the existing
   cluster secret:
   ```sh
   sudo -u quptime qu init \
     --advertise delta.example.com:9901 \
     --secret '<existing cluster secret>'
   sudo systemctl start quptime
   ```
 3. From a surviving node, invite the new one:
   ```sh
   sudo -u quptime qu node add delta.example.com:9901
   ```
 The dead node's checks and alerts are unaffected — they live in the
 replicated `cluster.yaml`, not the dead node's identity.
 ## Recovering from lost quorum
 You've lost more than half the cluster simultaneously. The remaining
 nodes refuse to mutate (correct behaviour: they have no way to know
 whether the missing nodes are dead or partitioned).
 Options:
 - **Bring the missing nodes back.** Always the right first move if it's
  possible. The cluster recovers automatically once enough nodes are
  live.
 - **Shrink the cluster.** If you've genuinely lost the missing nodes
  permanently and can't bring them back, you need to manually edit
  `cluster.yaml` on every surviving node to remove the dead peers,
  then restart. Be very deliberate:
  ```sh
  # On each surviving node:
  sudo systemctl stop quptime
  sudoedit /etc/quptime/cluster.yaml   # delete the dead peers[] entries
                                        # bump version to something higher
  sudo systemctl start quptime
  ```
  Make sure every surviving node has identical `cluster.yaml` content
  before restarting any of them. If they don't, you'll get conflicting
  views of who's in the cluster and elections will flap.
 - **Start over.** For small clusters this is often faster than the
  manual surgery above: `rm -rf /etc/quptime` everywhere, then
  bootstrap from scratch. You'll lose your checks and alerts unless
  you saved a copy of `cluster.yaml` elsewhere.
 ## Monitoring `qu` itself
 `qu` watches your services. Who watches `qu`?
 ### From within the cluster
 `qu status` is the single source of truth. The fields to watch:
 | Field          | Healthy        | Suspicious                                                |
 | -------------- | -------------- | --------------------------------------------------------- |
 | `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
 | `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
 | `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
 | `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |
 A simple cron sentinel on each node:
 ```sh
 */5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \
  || curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall
 ```
 ### From outside the cluster
 `qu` does not currently expose a Prometheus / OpenMetrics endpoint.
 The recommended pattern is to run a *separate* tiny monitoring path
 that doesn't depend on `qu` — even a single `curl` health check on
 each node's :9901 (which is TLS-only; you'll see a handshake succeed
 even if the daemon's stuck) catches process death.
 To produce structured metrics, write a sidecar that parses `qu status`
 output and exports counters. The CLI emits stable, machine-grep-able
 output specifically so this is straightforward.
 ## Operational checklist before you go to bed
 After standing up a new cluster, work through:
 - [ ] All nodes show `quorum true` in `qu status`.
 - [ ] All nodes show identical `config ver`.
 - [ ] All nodes show the same `master`.
 - [ ] `journalctl -u quptime --since "10 min ago"` has no
      `propose to master:` or `replicate: pull from:` errors.
 - [ ] `qu alert test <name>` reaches your inbox / Discord channel for
      every configured alert.
 - [ ] At least one check has an intentional failure (a bogus target)
      that you flip back and forth to verify the full state-transition
      → dispatch path end-to-end.
 - [ ] Backups of `node.yaml` + `keys/` + `cluster.yaml` are landing in
      your backup destination.
 - [ ] Firewall allow-list (if any) lists every peer's IP.
 - [ ] You've stored the cluster secret somewhere that survives the
      first operator leaving.
@@ -0,0 +1,153 @@
 # Security
 The trust model in one page. Read this before deciding where to put
 `qu` and who can talk to it.
 ## What `qu` is trying to defend against
 - **Eavesdropping on cluster traffic.** Defended: TLS 1.3 only,
  fingerprint-pinned per peer.
 - **MITM on the cluster's inter-node link.** Defended: TLS 1.3 with
  out-of-band fingerprint verification at `qu node add`.
 - **A random internet host enrolling itself as a peer.** Defended:
  pre-shared cluster secret on every `Join`.
 - **A compromised peer issuing forged cluster-config mutations.** Not
  defended. A peer trusted enough to be in `cluster.yaml.peers` can
  propose mutations through the master. Treat membership as a
  privilege.
 - **A compromised peer becoming master.** Election is deterministic on
  the smallest live `NodeID`, so a compromised peer can become master
  if its `NodeID` sorts first. The master can rewrite `cluster.yaml`
  arbitrarily. This is the worst-case blast radius from one compromised
  node.
 - **DoS by handshake flood.** Not directly defended at the application
  layer. The TLS stack accepts anyone's handshake; rate-limiting belongs
  at the firewall — see [public-internet.md](deployment/public-internet.md).
 ## The three secrets on disk
 | Secret                     | What it is                                | Loss impact                                  |
 | -------------------------- | ----------------------------------------- | -------------------------------------------- |
 | `keys/private.pem`         | RSA private key, this node's identity.    | Anyone with it can impersonate this node.    |
 | `node.yaml.cluster_secret` | Pre-shared base64 string.                 | Anyone with it can `Join` the cluster.       |
 | `trust.yaml.entries[].cert_pem` | Other peers' public certs (not secrets, but they enable mTLS). | Loss only forces re-trust. |
 The first two are real secrets and live under `0600` permissions in
 the data directory. Back them up; never commit them; never paste them
 in chat.
 ## TLS handshake step by step
 For every inter-node call:
 1. Caller dials peer on its `advertise` address.
 2. TLS 1.3 handshake. Both sides present their self-signed leaf cert.
 3. The caller's `VerifyPeerCertificate` (set in
   `internal/transport/tls.go`) computes the SPKI fingerprint of the
   server's cert and compares it against `trust.yaml`. If the caller
   knows which `NodeID` it expected, a strict verifier ensures the
   fingerprint matches *that specific* entry — not just any trusted
   peer.
 4. The server's TLS layer accepts any client cert (`RequireAnyClientCert`,
   `InsecureSkipVerify: true`) because trust is enforced one layer up.
 5. The RPC dispatcher reads the client's cert, computes its
   fingerprint, and looks it up in the server's `trust.yaml`. If no
   entry exists, only the `Join` method is permitted.
 6. `Join` performs a constant-time comparison of the inbound
   `ClusterSecret` against `node.yaml.cluster_secret`. Mismatch →
   refusal.
 So:
 - An adversary who gets your **public** cert can't impersonate you.
 - An adversary who gets your **fingerprint** can't impersonate you.
 - An adversary who gets your **private key** *can* impersonate you to
  any peer that trusts your fingerprint.
 ## The TOFU step
 `qu node add <host:port>` runs a one-shot insecure dial against the
 target (the only place `InsecureBootstrapConfig` is used in the
 codebase, see `internal/transport/tls.go:91`). It fetches the
 remote's cert, prints the fingerprint, and asks for confirmation.
 This is **identical** to SSH's first-connection prompt. The operator
 must verify the fingerprint out of band — by running `qu status` on
 the remote side, or by reading `keys/cert.pem` directly, or via a
 known-good distribution channel.
 If you skip verification, you trust the network at that moment. If
 the network was MITM'd at exactly that moment, you trust the
 attacker. After the prompt, the cert is pinned and the window closes.
 ## Cluster secret rotation
 There is no built-in command to rotate the cluster secret. The hard
 part isn't generating a new one — it's distributing it consistently
 across every node. The pragmatic recipe:
 1. Generate a new secret on one node and copy it to every other node.
 2. Update `node.yaml.cluster_secret` on every node (manual edit).
 3. Restart each daemon one at a time, verifying quorum returns
   between restarts.
 Rotation only protects future `Join` calls, not anything else. If you
 suspect the old secret has been seen by an adversary, also assume any
 peer that was added during the leaked window is compromised, and
 re-init those peers from scratch.
 ## Identity rotation
 To roll a node's RSA keypair (e.g., the private key was on a laptop
 that got stolen):
 ```sh
 # On the compromised node:
 sudo systemctl stop quptime
 sudo rm -rf /etc/quptime
 sudo -u quptime qu init \
  --advertise this-host.example.com:9901 \
  --secret '<existing cluster secret>'
 sudo systemctl start quptime
 # On a surviving healthy node:
 sudo -u quptime qu node remove <old-node-id>      # evict the old identity
 sudo -u quptime qu node add this-host.example.com:9901
 ```
 The new `node_id` is a fresh UUID; the old one is gone for good. Any
 historical references to it (e.g., the `updated_by` field on past
 versions of `cluster.yaml`) are cosmetic.
 ## What the local control socket protects
 `$XDG_RUNTIME_DIR/quptime/quptime.sock` (or `/var/run/quptime/...`) is
 the channel the CLI uses to talk to the local daemon. It's `0600`
 permissioned and authenticated solely by filesystem ACLs — no TLS, no
 secrets in the protocol.
 Anyone who can `read+write` the socket can:
 - Propose cluster mutations (will be relayed to the master).
 - Read full cluster state including `cluster.yaml`.
 - Trigger test alerts.
 So: don't put the daemon's user in a group that other unprivileged
 users share. The default systemd setup with a dedicated `quptime`
 user gets this right.
 ## Hardening checklist
 - [ ] Dedicated `quptime` system user.
 - [ ] Data directory owned by that user, mode 0750.
 - [ ] `keys/private.pem` mode 0600.
 - [ ] `node.yaml` mode 0600.
 - [ ] systemd unit uses `ProtectSystem=strict`, `NoNewPrivileges=true`,
      and the rest of the hardening directives in
      [systemd.md](deployment/systemd.md).
 - [ ] If `:9901` is internet-reachable, firewall allow-list to peer
      IPs or use an overlay — see [public-internet.md](deployment/public-internet.md)
      and [tailscale.md](deployment/tailscale.md).
 - [ ] Cluster secret generated by `qu init` (not chosen by a human),
      stored in your secret manager.
 - [ ] Backups of `keys/` and `node.yaml` are encrypted at rest.
@@ -0,0 +1,199 @@
 # Troubleshooting
 The cluster is misbehaving. This page is organised by symptom. Each
 entry pairs the user-visible signal with the log line(s) you'll see
 in `journalctl -u quptime` and the fix.
 ## `qu status` shows `quorum  false`
 **What it means.** Fewer than ⌈N/2⌉+1 peers are live.
 **Diagnose.** Look at the PEERS table. The `LIVE` column tells you
 which peers this node has stopped hearing from.
 - If only this node is "live" and everyone else is not → this node is
  network-isolated. Test: `nc -zv <peer-advertise>`. Fix: network /
  firewall.
 - If multiple nodes show false → more than one peer is down. Look at
  the other peers' status outputs to triangulate.
 - If everyone is live but `quorum false` still → check
  `cluster.yaml.peers` length vs. live count; you may have phantom
  peer entries left over from a removed-but-not-evicted node. Fix:
  `qu node remove <ghost-node-id>` from any live node.
 ## `qu status` shows `master  (none — ...)`
 **What it means.** Either no quorum (see above) or election is in
 flight. The latter clears within ~1 heartbeat.
 If `term` is incrementing rapidly (`watch qu status`), the master is
 flapping. Causes:
 - The currently-elected master is unreachable from some peers but
  reachable from others, partial-partition style. Look for log lines
  on the suspected master about peers it can't reach.
 - Heartbeat timeouts (default 4s) are too tight for your inter-node
  link. Rebuild with a higher `DefaultDeadAfter` if you need it.
 ## A check is stuck in `unknown`
 **What it means.** The aggregator has no fresh reports for that check.
 Possible causes:
 - No node is actually running the probe yet. Probes start ~`interval/10`
  after `qu serve` boots and reconcile every 5s. Wait 10s and
  re-check.
 - Nodes are submitting results but they're stale (older than 3×
  interval). Probably means probes are timing out without reporting.
 - This is a follower's view; the aggregator runs on the master only.
  Check `qu status` on the master to see the canonical view.
 ## Alerts not firing
 Walk this list in order; one of them will catch it:
 1. **Is there quorum?** Aggregator runs on master only. No master →
   no transitions → no alerts.
 2. **Is the alert attached to the check?** `qu status` shows the
   effective alert list per check. Empty → no alert. Confirm with
   `qu alert list` that the alert exists and (if relying on default
   attachment) has `default: true`.
 3. **Is the alert suppressed on this check?** Check
   `suppress_alert_ids` in `cluster.yaml`.
 4. **Test the alert path directly:**
   ```sh
   sudo -u quptime qu alert test <name>
   ```
   This bypasses the aggregator and renders a synthetic transition.
   If `alert test` doesn't deliver, the problem is the notifier
   config or the template — see below. If `alert test` works but real
   transitions don't, the aggregator isn't observing the transition.
 5. **Has the check actually transitioned?** Aggregator commits a flip
   only after **two consecutive** evaluations agree. A bouncing
   target may never satisfy the hysteresis. Lower the check interval
   or increase reliability of the target.
 ## Discord webhook returns 4xx
 The dispatcher logs the HTTP body. Common causes:
 - Webhook revoked / channel deleted → 404. Re-issue and update
  `discord_webhook`.
 - Body too large → 400. Long templates that pull `Snapshot.Detail`
  with multi-line errors can blow past Discord's 2000-char limit.
  Shorten the template or trim the variable.
 - Rate-limited → 429. Reduce alert frequency or stop suppressing
  hysteresis.
 ## SMTP refuses the message
 Check the daemon log for `smtp:` lines. Most common:
 - `530 5.7.0 Must issue a STARTTLS command first` → set
  `smtp_starttls: true` on the alert.
 - `535 Authentication failed` → wrong `smtp_user` / `smtp_password`.
 - Connection refused / timeout → firewall between `qu` and the SMTP
  relay. Verify with `openssl s_client -starttls smtp -connect host:587`.
 ## Manual edit to `cluster.yaml` was ignored
 Symptoms: you edited the file, saved, nothing happened.
 Look for one of these log lines:
 - `manual-edit: parse cluster.yaml: <err> — ignoring` → YAML is
  invalid. The daemon pins the bad hash and waits for the next valid
  save. Run the file through `yq` or `python -c "import yaml,sys;
  yaml.safe_load(open(sys.argv[1]))" cluster.yaml` to diagnose.
 - `manual-edit: cluster.yaml changed externally — replicating via
  master` followed by `manual-edit: forward to master: no quorum` →
  cluster has no quorum, can't accept the edit. Restore quorum first.
 - *No log line at all* → the on-disk content didn't change in a way
  that matters. The watcher compares only `peers`, `checks`, and
  `alerts`; whitespace and comment edits are accepted silently.
 ## Two nodes disagree on `config ver`
 The follower with the lower version should pull within one heartbeat.
 If after ~5 seconds the gap persists:
 - The follower might not have an `advertise` address for the higher-
  versioned peer. The version observer needs one to pull. Check
  `cluster.yaml.peers` for both sides' `advertise` fields.
 - The follower's TLS handshake against the higher-versioned peer is
  failing — look for `replicate: pull from <id>: <err>` lines.
 - The peer with the higher version is announcing it correctly but the
  follower is rejecting the `ApplyClusterCfg` broadcasts because of
  its own decode error — look for transport-layer errors instead.
 ## "needs ≥2 live to mutate" rejection during bootstrap
 You ran two `qu node add` commands back-to-back and the second one
 failed. The first add doesn't take effect until the new peer sends
 its first heartbeat (≤ 1 second); during that window the cluster has
 size 2 and quorum size 2, so a *second* peer add from a 1-live
 cluster looks like "mutate without quorum."
 Fix: pause ~3 seconds between adds. The README and the systemd guide
 both call this out.
 ## Daemon refuses to start
 ```
 load node.yaml: open ...: no such file or directory
 ```
 Run `qu init` before `qu serve`. The daemon does not auto-init —
 silently generating identities and secrets would be a worse failure
 mode than crashing.
 ```
 node.yaml has empty node_id — run `qu init` first
 ```
 Same fix.
 ```
 listen tcp :9901: bind: address already in use
 ```
 Another process owns the port. `ss -tlnp | grep :9901` to find it.
 ```
 load private key: ...
 ```
 Permissions on `keys/private.pem` are wrong — should be 0600 and owned
 by the daemon user. Fix and restart.
 ## Probes look much slower than expected
 ICMP first:
 - Default ICMP is **unprivileged UDP-mode pings**, not raw ICMP. UDP
  ping is a bit slower and may hit different kernel paths. For
  reference latency, grant `CAP_NET_RAW`.
 HTTP / TCP:
 - `interval` and `timeout` are the only knobs in `cluster.yaml`. The
  check is run synchronously per worker; if your target takes 9 s to
  respond and your timeout is 10 s, the next probe doesn't start
  until ~9 s elapsed. Increase concurrency by adding more
  fast-interval checks against the same target, not by lowering
  timeout (which will just produce false `down` results).
 ## I want to start over
 ```sh
 sudo systemctl stop quptime
 sudo rm -rf /etc/quptime
 sudo -u quptime qu init --advertise <addr>
 sudo systemctl start quptime
 ```
 The data directory is the only state. Wipe it and you're back to a
 fresh node.