From 69537095743a2f4831ba3f53388a32ac92d799eb Mon Sep 17 00:00:00 2001 From: Axodouble Date: Fri, 15 May 2026 04:05:30 +0000 Subject: [PATCH] AI assisted documentation --- README.md | 17 ++ docs/README.md | 53 ++++++ docs/architecture.md | 196 +++++++++++++++++++++ docs/configuration.md | 273 +++++++++++++++++++++++++++++ docs/deployment/docker.md | 198 +++++++++++++++++++++ docs/deployment/public-internet.md | 180 +++++++++++++++++++ docs/deployment/systemd.md | 250 ++++++++++++++++++++++++++ docs/deployment/tailscale.md | 181 +++++++++++++++++++ docs/installation.md | 104 +++++++++++ docs/operations.md | 225 ++++++++++++++++++++++++ docs/security.md | 153 ++++++++++++++++ docs/troubleshooting.md | 199 +++++++++++++++++++++ 12 files changed, 2029 insertions(+) create mode 100644 docs/README.md create mode 100644 docs/architecture.md create mode 100644 docs/configuration.md create mode 100644 docs/deployment/docker.md create mode 100644 docs/deployment/public-internet.md create mode 100644 docs/deployment/systemd.md create mode 100644 docs/deployment/tailscale.md create mode 100644 docs/installation.md create mode 100644 docs/operations.md create mode 100644 docs/security.md create mode 100644 docs/troubleshooting.md diff --git a/README.md b/README.md index 5c8d8ec..e4ff6bb 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,23 @@ definition, can't tell you when it's the one that's down. `qu` solves both: run it on a few cheap hosts in different networks and they vote on truth. If one of them loses its uplink, the rest keep alerting. +## Documentation + +This README is the quick-start. For production use, the longer guides +live under [`docs/`](docs/README.md): + +| If you want to… | Read | +| ----------------------------------------------------- | ------------------------------------------------------------------ | +| understand the consensus / replication model | [docs/architecture.md](docs/architecture.md) | +| reference every field in `node.yaml` / `cluster.yaml` | [docs/configuration.md](docs/configuration.md) | +| deploy on Linux with systemd hardening | [docs/deployment/systemd.md](docs/deployment/systemd.md) | +| deploy with Docker / docker-compose | [docs/deployment/docker.md](docs/deployment/docker.md) | +| deploy over Tailscale or WireGuard | [docs/deployment/tailscale.md](docs/deployment/tailscale.md) | +| expose `qu` on the open internet safely | [docs/deployment/public-internet.md](docs/deployment/public-internet.md) | +| upgrade, back up, or recover from failures | [docs/operations.md](docs/operations.md) | +| understand the trust model and rotate identities | [docs/security.md](docs/security.md) | +| diagnose a misbehaving cluster | [docs/troubleshooting.md](docs/troubleshooting.md) | + ## Architecture ``` diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..4e972ca --- /dev/null +++ b/docs/README.md @@ -0,0 +1,53 @@ +# QUptime documentation + +Production-oriented documentation for `qu`, a small distributed uptime +monitor that votes on the health of HTTP/TCP/ICMP targets across a +cluster of cooperating nodes. + +The top-level `README.md` is the marketing pitch and quick-start. The +pages here go deeper and are organised by what you're trying to do. + +## Getting set up + +- [Installation](installation.md) — pre-built binaries, building from + source, verifying release artifacts, what the install script does. +- [Configuration](configuration.md) — `node.yaml`, `cluster.yaml`, + `trust.yaml`, environment variables, file layout, defaults. + +## Running it + +- [Architecture](architecture.md) — how nodes form quorum, how a master + is elected, how cluster state replicates, what happens during a + partition, and exactly which guarantees the design gives you. +- [Operations](operations.md) — day-2 tasks: upgrades, backups, + recovery from a lost node, recovery from a lost quorum, monitoring + `qu` itself. +- [Security](security.md) — the mTLS / TOFU trust model, what the + cluster secret protects, how to rotate keys, what to put on a public + network and what not to. +- [Troubleshooting](troubleshooting.md) — common failure modes with + the log lines you'll see and the fix. + +## Deployment recipes + +Pick the one that matches your environment. They share most of the +operational guidance — what differs is how `qu` is packaged and how +the inter-node link is secured at the network layer. + +- [systemd on bare metal / VM](deployment/systemd.md) — single static + binary, hardened unit file, `CAP_NET_RAW` for ICMP. +- [Docker / docker-compose](deployment/docker.md) — official image, + single-node and multi-node compose files, persistent volumes. +- [Tailscale / WireGuard overlay](deployment/tailscale.md) — nodes in + separate networks with no public ingress; cluster traffic stays on + the tailnet. +- [Public-internet exposure](deployment/public-internet.md) — when + you have no overlay and `:9901` is reachable from the open + internet: firewalling, rate-limiting, secret hygiene. + +## A note on stability + +The wire protocol (`internal/transport`) and the on-disk format +(`cluster.yaml`, `node.yaml`, `trust.yaml`) are considered stable +within a minor version. Breaking changes will bump the major version +and ship with a migration note. diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..84f6a5f --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,196 @@ +# Architecture + +This page is the long-form companion to the diagram in the top-level +README. Read it if you need to reason about partitions, recovery, +upgrade ordering, or the consistency guarantees of `qu`. + +## Components + +A running `qu serve` is one process containing five long-lived +goroutines plus the listeners: + +| Component | Package | Role | +| --------------- | ------------------------ | ------------------------------------------------------------------------ | +| Transport | `internal/transport` | mTLS listener + dialer, length-prefixed JSON-RPC framing. | +| Quorum manager | `internal/quorum` | 1 Hz heartbeats, liveness tracking, deterministic master election. | +| Replicator | `internal/replicate` | Master-routed mutations, version-gated broadcast and pull. | +| Scheduler | `internal/checks` | One goroutine per check; runs HTTP/TCP/ICMP probes on each node. | +| Aggregator | `internal/checks` | Master-only. Folds per-node probe results into a cluster-wide verdict. | +| Alert dispatch | `internal/alerts` | Master-only. Renders templates and ships SMTP / Discord notifications. | +| Control socket | `internal/daemon` | Local-only unix socket; the CLI and TUI talk to the daemon through it. | + +Every node runs every component. Whether the master-only ones actually +*do* anything depends on the result of master election. + +## Trust and transport + +Inter-node traffic is TLS 1.3 with mutual authentication. There is **no +central CA**. Each node generates a self-signed RSA cert at `qu init` +and the SPKI fingerprint of that cert is what other nodes pin against. + +Two layers gate access: + +1. **TLS layer** accepts any client cert. This avoids a chicken-and-egg + during bootstrap — a brand-new node has no entry in anyone's trust + store yet, so a strict TLS check would refuse the very first + handshake. +2. **RPC dispatcher** rejects every method except `Join` for callers + whose presented fingerprint is not in `trust.yaml`. So an untrusted + peer can knock on the door but cannot ask questions. + +`Join` itself is gated by the **cluster secret** — a pre-shared base64 +string generated at `qu init` on the first node. Without it, an +attacker who can reach `:9901` cannot enrol themselves into the +cluster. + +The local CLI talks to the daemon over a unix socket with `0600` +permissions; filesystem ACLs are the only authentication and no TLS is +used on that channel. + +## The replicated state machine + +`cluster.yaml` is the single replicated source of truth. It holds three +editable lists — `peers`, `checks`, `alerts` — plus three +server-controlled fields: + +```yaml +version: 7 # monotonically increasing +updated_at: 2026-05-15T... +updated_by: # master that committed this version +peers: [...] +checks: [...] +alerts: [...] +``` + +### How mutations flow + +1. The CLI (or the manual-edit watcher; see below) issues a mutation + on the local daemon's control socket. +2. The daemon's replicator looks at the current quorum view: + - If there is no quorum, the mutation fails loudly with + `no quorum: refusing mutation`. + - If this node is the master, apply locally and broadcast. + - Otherwise, ship the mutation to the master via the + `ProposeMutation` RPC and wait for the result. +3. The master holds the cluster lock, applies the mutation, bumps + `version`, writes `cluster.yaml` atomically, and broadcasts the new + snapshot to every peer via `ApplyClusterCfg`. +4. Each follower's `Replace` accepts the snapshot **only if** + `incoming.Version > local.Version`. Older or equal versions are + dropped silently. + +The mutation kinds are enumerated in `internal/transport/messages.go`: +`add_check`, `remove_check`, `add_alert`, `remove_alert`, `add_peer`, +`remove_peer`, `replace_config`. + +### Manual edits to `cluster.yaml` + +Operators can `sudoedit /etc/quptime/cluster.yaml` on any node. Every +2 seconds the daemon hashes the file. When the on-disk hash diverges +from the last hash the daemon wrote, the new content is parsed and +forwarded to the master as a `replace_config` mutation. So a hand-edit +on a follower still ends up on the master, version-bumped, and +broadcast everywhere. + +If the parse fails (invalid YAML), the daemon logs and pins the bad +hash so it doesn't loop. The operator's next valid save unblocks it. + +## Quorum and master election + +Every node sends a heartbeat to every peer once per second. A peer is +**live** if a heartbeat (sent or received) was observed within the +last 4 seconds — comfortably more than three missed beats so a one-tick +blip does not unseat the master. + +**Quorum** is met when `len(live_peers) >= floor(N/2) + 1` where `N` +is the total peer count in `cluster.yaml`. Below quorum, the cluster +refuses every mutation; existing checks continue probing locally but no +state transitions are committed (the master is the only one who +aggregates, and there is no master). + +**Master election** is deterministic with no negotiation step: among +the live members, the master is the one with the lexicographically +smallest `NodeID`. Every node that observes the same live set picks the +same master — so there is no split-brain window even during a partial +partition. + +The `term` integer in `qu status` is bumped every time the elected +master changes (including transitions to and from "no master"). Use it +to spot flappy clusters. + +## Catch-up when a node reconnects + +This is the scenario most people ask about: node C is offline, the +master commits config version 7, node C comes back online. What +happens? + +1. Node C's tick loop fires heartbeats every second regardless of its + previous state. There is no backoff, no give-up. +2. Each heartbeat carries the sender's `Version`. Each response carries + the responder's `Version`. +3. The first time C sees a peer reporting a higher version than its + own, the version-observer fires and calls + `replicator.PullFrom(peerID, addr)`. +4. `PullFrom` does a `GetClusterCfg` RPC against that peer and feeds + the snapshot through `Replace`, which writes `cluster.yaml` + atomically and refreshes the on-disk hash so the manual-edit + watcher doesn't re-fire. +5. Within ~1 heartbeat C is byte-for-byte identical to the master. + +The same path catches a stale node up when the partition heals on the +minority side: the minority side cannot mutate, so when it rejoins it +strictly has the older version, and the pull fires. + +There is one corner case worth knowing about: the pull only fires when +`peer_version > local_version`. Two nodes at the same version with +different content would silently diverge — but the design forbids +that (only the master mutates, and the master is the only one bumping +the version) unless somebody hand-edits `cluster.yaml` and also +manually sets `version:`. Don't do that. + +## Why a check flips state + +The aggregator runs on the master only. Followers' probe results are +shipped to the master via the `ReportResult` RPC; the master's own +probe results are submitted directly. + +For each check, the aggregator keeps the latest result per node within +a freshness window (3× the check interval, minimum 30s). On each +incoming submission it counts OK vs not-OK across the fresh results: + +- 0 fresh reports → `unknown` +- more OK than not-OK → `up` +- more not-OK than OK → `down` +- tie → `up` (a tie at one report means one node says yes and one says + no; biasing toward `up` avoids false alerts when nodes disagree + transiently). + +A state flip is **not** committed immediately. Hysteresis requires the +candidate state to hold for **two consecutive aggregate evaluations** +before the state transition fires and the alert dispatcher is called. +Set in `internal/checks/aggregator.go` as the `HysteresisCount` +constant — change it there if you want a hair-trigger or a slower +alert. + +If the master changes, the new master starts the per-check state from +`unknown` and rebuilds it as fresh results arrive. The first few +seconds after a re-election can therefore show `unknown` even for +checks that were `up` a moment ago. + +## What `qu` does *not* do + +These omissions are intentional in v1 and useful to know up front: + +- **No persistent history.** Only the current aggregate state lives in + memory. There are no graphs, no SLA reports. Add a sidecar (Prometheus + exporter, SQLite logger) if you need them. +- **No automatic key rotation.** Re-init a node and re-trust if you + need to roll its identity. See [security.md](security.md). +- **No multi-tenant isolation.** One cluster = one set of checks = + one alert tree. +- **No web UI.** Operator surface is `qu` (CLI), `qu tui`, and direct + edits to `cluster.yaml`. +- **No automatic peer eviction on prolonged downtime.** A dead peer + stays in `cluster.yaml` until an operator runs `qu node remove`, + because that decision affects the quorum size and shouldn't happen + silently. diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..750635f --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,273 @@ +# Configuration + +This page is the canonical reference for the on-disk files, the +environment variables, and every field that `qu` reads. It's +deliberately tedious — when something doesn't behave the way you +expect, this is where the answer lives. + +## File layout + +When running as **root** (the typical case under systemd): + +``` +/etc/quptime/ +├── node.yaml identity, never replicated +├── cluster.yaml replicated state +├── trust.yaml local fingerprint trust store +└── keys/ + ├── private.pem RSA private key (0600) + ├── public.pem RSA public key + └── cert.pem self-signed X.509 cert + +/var/run/quptime/quptime.sock control socket (0600) +``` + +When running as a **non-root** user (the typical case for `go run` or a +desktop test): + +``` +~/.config/quptime/... same shape as /etc/quptime +$XDG_RUNTIME_DIR/quptime/quptime.sock control socket +``` + +Override the data directory with `QUPTIME_DIR=/some/path qu serve`. +Override the socket path with `QUPTIME_SOCKET=/run/foo.sock`. + +## Environment variables + +| Variable | Purpose | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------- | +| `QUPTIME_DIR` | Data directory. Defaults to `/etc/quptime` (root) or `$XDG_CONFIG_HOME/quptime`. | +| `QUPTIME_SOCKET` | Path to the CLI ↔ daemon unix socket. Defaults to `/var/run/quptime/quptime.sock` (root) or `$XDG_RUNTIME_DIR/quptime/…`. | +| `XDG_CONFIG_HOME` | Honored when running as non-root and `QUPTIME_DIR` is unset. | +| `XDG_RUNTIME_DIR` | Honored when running as non-root and `QUPTIME_SOCKET` is unset. | + +The daemon does not read any other environment variables. SMTP, Discord, +and HTTP probe targets are configured exclusively in `cluster.yaml`. + +## `node.yaml` — local identity + +Never replicated. One file per host. Generated by `qu init`. + +```yaml +node_id: 7f3a5b9e-... # UUIDv4, immutable after init +bind_addr: 0.0.0.0 # listen address for :9901 +bind_port: 9901 # listen port +advertise: alpha.example.com:9901 # how peers reach us; may differ from bind +cluster_secret: 4hZqK8vT9... # base64; required to Join, never replicated +``` + +### Field reference + +- `node_id` — UUIDv4 generated at `qu init`. Used by every peer to + refer to this node across IP changes and restarts. Do not edit. +- `bind_addr` — Address the daemon listens on. `0.0.0.0` is the + default. Set to `127.0.0.1` if you only want to expose the daemon + through an overlay (Tailscale, WireGuard) — see + [deployment/tailscale.md](deployment/tailscale.md). +- `bind_port` — Defaults to `9901`. Change here if 9901 is taken; the + cluster does not require port-uniformity, peers just need to know + what to dial via the `advertise` field. +- `advertise` — Host:port other nodes use to reach this one. Must be + routable from every peer. Falls back to `bind_addr:bind_port` if + unset, which is rarely what you want behind NAT. +- `cluster_secret` — Pre-shared base64 string. Required on every + `Join` RPC; constant-time comparison on the receiver. Generate on + the first node, distribute out-of-band, keep out of version + control. + +### How `qu init` populates this file + +```sh +qu init \ + --advertise alpha.example.com:9901 \ + --bind 0.0.0.0 \ + --port 9901 \ + --secret '' +``` + +Idempotent in one direction only: if `node.yaml` exists, `qu init` +refuses to overwrite. To re-init, delete the data directory entirely. + +## `cluster.yaml` — replicated state + +This is the file that every node converges on. The master is the only +one allowed to bump `version`; followers `Replace` it whole each time +they receive a higher-versioned snapshot. + +```yaml +version: 12 +updated_at: 2026-05-15T14:01:00Z +updated_by: 7f3a5b9e-... +peers: + - node_id: 7f3a5b9e-... + advertise: alpha.example.com:9901 + fingerprint: SHA256:abcd... + cert_pem: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- +checks: + - id: 0006a1... + name: homepage + type: http + target: https://example.com + interval: 30s + timeout: 10s + expect_status: 200 + alert_ids: [oncall] + suppress_alert_ids: [] +alerts: + - id: f001ab... + name: oncall + type: discord + default: true + discord_webhook: https://discord.com/api/webhooks/... + body_template: | + :rotating_light: {{.Check.Name}} is {{.Verb}} +``` + +### Top-level fields + +| Field | Owner | Notes | +| ------------ | -------- | ---------------------------------------------------------------------------------- | +| `version` | master | Monotonic. Followers reject snapshots whose version is ≤ their local. | +| `updated_at` | master | UTC RFC3339. Cosmetic — humans use it, no logic depends on it. | +| `updated_by` | master | NodeID of the committing master. | +| `peers` | editable | Cluster members. Edits go through `add_peer` / `remove_peer` mutations. | +| `checks` | editable | Monitored targets. | +| `alerts` | editable | Notifier destinations. | + +### `peers[]` + +```yaml +- node_id: 7f3a5b9e-... # immutable, the peer's own UUID + advertise: host:port # how anyone dials this peer + fingerprint: SHA256:... # SPKI fingerprint of the peer's cert + cert_pem: | # full PEM so other peers can mTLS without a separate invite + -----BEGIN CERTIFICATE----- + ... +``` + +The `cert_pem` field is what enables N-node clusters without N×(N-1) +manual invites: when peer X is added via the master, every other node +that receives the new `cluster.yaml` learns X's cert at the same time +and adds it to the local trust store. See +`internal/daemon/daemon.go:syncTrustFromCluster`. + +### `checks[]` + +```yaml +- id: 0006a1... # UUIDv4, generated when the check is created + name: homepage # human-friendly, must be unique within cluster + type: http # http | tcp | icmp + target: https://example.com + interval: 30s # Go duration syntax: 5s, 1m30s, 2h + timeout: 10s # default 10s + expect_status: 200 # http only; 0 = accept anything < 400 + body_match: "OK" # http only; substring match on response body + alert_ids: [oncall] # alerts attached explicitly + suppress_alert_ids: [] # opt out of specific default alerts +``` + +Defaults: + +- `interval`: 30s +- `timeout`: 10s +- `expect_status`: 0 → any 2xx is OK; otherwise the configured status + must match exactly. + +ICMP checks default to **unprivileged UDP-mode pings** so the daemon +does not need root. For raw ICMP, grant the capability — see +[deployment/systemd.md](deployment/systemd.md). + +### `alerts[]` + +Two notifier kinds, distinguished by `type`: + +```yaml +# Discord +- id: f001ab... + name: oncall + type: discord + default: true # attach to every check automatically + discord_webhook: https://... + body_template: | # optional Go text/template override + {{.Check.Name}} is {{.Verb}} + +# SMTP +- id: f002cd... + name: ops + type: smtp + smtp_host: smtp.example.com + smtp_port: 587 + smtp_user: mailbot + smtp_password: '...' + smtp_from: monitor@example.com + smtp_to: [ops@example.com] + smtp_starttls: true + subject_template: '[{{.Verb}}] {{.Check.Name}}' + body_template: | + Check {{.Check.Name}} ({{.Check.Target}}) is now {{.Verb}}. +``` + +If `default: true`, the alert fires for every check unless the check +lists the alert's ID or name in `suppress_alert_ids`. Otherwise the +alert only fires for checks that name it in `alert_ids`. + +Templates are Go `text/template`. The full variable list is in the +top-level README under "Custom alert messages" — `qu alert add smtp +--help` and `qu alert add discord --help` print the same table. + +### Suppression precedence + +For each check, the dispatcher computes the effective alert list as: + +``` +( explicit alert_ids ∪ alerts with default=true ) \ suppress_alert_ids +``` + +de-duplicated by alert ID. So a check can both opt in to specific +alerts and opt out of specific defaults. + +## `trust.yaml` — local trust store + +A flat list of fingerprints this node accepts. One entry per peer, +populated by `qu node add` (or pulled in automatically when a peer's +cert arrives via the replicated `cluster.yaml`). + +```yaml +entries: + - node_id: 7f3a5b9e-... + address: alpha.example.com:9901 + fingerprint: SHA256:... + cert_pem: | + -----BEGIN CERTIFICATE----- + ... +``` + +Never edit this by hand. Use `qu trust list` and `qu trust remove`. + +## Key material + +`keys/private.pem` is the only secret on disk besides +`node.yaml.cluster_secret`. It's chmod 0600 by default; preserve that. +The public cert at `keys/cert.pem` is what gets fingerprinted and +shipped in `cluster.yaml.peers[].cert_pem`. + +There is **no automatic key rotation**. Rolling a node's identity +means wiping its data directory, running `qu init` again, and +re-adding it from another node as a fresh peer. + +## Tunables that don't live in YAML + +A few values are compiled constants. Change them in source and rebuild +if you need different behaviour. + +| Constant | Default | What it does | +| ----------------------------------------------------- | ------- | ------------------------------------------------------------- | +| `quorum.DefaultHeartbeatInterval` | `1s` | How often each node heartbeats every peer. | +| `quorum.DefaultDeadAfter` | `4s` | A peer is dead if no heartbeat is seen within this window. | +| `checks.HysteresisCount` | `2` | Consecutive aggregate evaluations needed before a state flip. | +| `checks.ReconcileInterval` | `5s` | How often the scheduler reconciles its workers vs `checks[]`. | +| `daemon.manualEditPollInterval` (`internal/daemon/watcher.go`) | `2s` | How often the daemon hashes `cluster.yaml` for hand edits. | diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md new file mode 100644 index 0000000..7f22607 --- /dev/null +++ b/docs/deployment/docker.md @@ -0,0 +1,198 @@ +# Deployment: Docker / docker-compose + +The published image is a 14 MB distroless static container with the +`qu` binary as the entrypoint. It runs as root by default so the +daemon can bind privileged ports and open ICMP sockets; override with +`--user` if your host doesn't need that. + +## Image references + +``` +git.cer.sh/axodouble/quptime:master # tip of main, multi-arch +git.cer.sh/axodouble/quptime:v0.1.0 # tagged release +git.cer.sh/axodouble/quptime:v0.1.0-amd64 # single-arch (if you must pin) +``` + +The image embeds `QUPTIME_DIR=/etc/quptime` and declares it a volume — +treat it as the only piece of state worth persisting. + +## Single-node, single-container compose + +For a development cluster or a single-node smoke test: + +```yaml +# compose.yaml +services: + quptime: + image: git.cer.sh/axodouble/quptime:v0.1.0 + container_name: quptime + restart: unless-stopped + ports: + - "9901:9901" + volumes: + - quptime-data:/etc/quptime + # ICMP UDP-mode pings need a permissive sysctl on the host: + # sysctl net.ipv4.ping_group_range="0 2147483647" + # Or grant CAP_NET_RAW (more accurate, raw ICMP). + cap_add: + - NET_RAW + +volumes: + quptime-data: +``` + +You must **`qu init` before the daemon will start**. With this compose +file: + +```sh +docker compose run --rm quptime init --advertise :9901 +docker compose up -d +docker compose exec quptime qu status +``` + +`` must be reachable from every other node — the loopback +address inside the container is useless to peers. + +## Three-node compose on a single host + +For local testing of the full quorum machinery without three machines: + +```yaml +# compose.yaml +x-quptime: &quptime + image: git.cer.sh/axodouble/quptime:v0.1.0 + restart: unless-stopped + cap_add: + - NET_RAW + +services: + alpha: + <<: *quptime + container_name: alpha + ports: ["9901:9901"] + volumes: ["alpha-data:/etc/quptime"] + + bravo: + <<: *quptime + container_name: bravo + ports: ["9902:9901"] + volumes: ["bravo-data:/etc/quptime"] + + charlie: + <<: *quptime + container_name: charlie + ports: ["9903:9901"] + volumes: ["charlie-data:/etc/quptime"] + +volumes: + alpha-data: + bravo-data: + charlie-data: +``` + +Bootstrap: + +```sh +# First node: prints the secret to stdout. +docker compose run --rm alpha init --advertise alpha:9901 +# Capture the secret (or read it back from alpha-data). +SECRET=$(docker compose exec alpha cat /etc/quptime/node.yaml | grep cluster_secret | awk '{print $2}') + +docker compose run --rm bravo init --advertise bravo:9901 --secret "$SECRET" +docker compose run --rm charlie init --advertise charlie:9901 --secret "$SECRET" + +docker compose up -d + +# Invite from alpha. The hostnames resolve over the compose network. +docker compose exec alpha qu node add bravo:9901 +sleep 3 # wait for heartbeats before the next add +docker compose exec alpha qu node add charlie:9901 + +docker compose exec alpha qu status +``` + +For a cluster on three separate hosts, replicate the compose file on +each box with different `advertise` addresses (the public hostname or +the overlay IP) and bootstrap the same way. + +## Multi-host compose + +The natural unit is one compose file per host, each running one +`qu` container. The minimum-viable file per host: + +```yaml +# /etc/qu-stack/compose.yaml +services: + quptime: + image: git.cer.sh/axodouble/quptime:v0.1.0 + container_name: quptime + restart: unless-stopped + ports: + - "9901:9901" + volumes: + - /srv/quptime/data:/etc/quptime + cap_add: + - NET_RAW +``` + +Persistence is a bind-mount under `/srv/quptime/data` so backups and +upgrades hit a known path. See [operations.md](../operations.md) for +the backup recipe. + +Inter-host traffic on TCP/9901 must be reachable. If the boxes don't +share a private network, prefer the +[Tailscale recipe](tailscale.md) over exposing 9901 directly — see +[public-internet.md](public-internet.md) for the threat model if you +must expose it. + +## Behind a reverse proxy + +**Don't.** `qu` is mTLS-pinned at the application layer, so a TLS- +terminating proxy would force the daemon to trust whatever cert the +proxy presents — defeating fingerprint pinning. If you need a single +public address per node, use a Layer 4 TCP proxy (`nginx stream`, +HAProxy `mode tcp`, or a plain firewall NAT) that forwards bytes +without touching them. + +## Image internals + +Build locally if you want to inspect what you're running: + +```sh +docker buildx build \ + --build-arg VERSION=$(git describe --tags --always) \ + --platform linux/amd64,linux/arm64 \ + --file docker/Dockerfile \ + --tag quptime:dev \ + --load \ + . +``` + +The Dockerfile (see `docker/Dockerfile`) is two stages: a `golang:1.24-alpine` +builder that cross-compiles with `-trimpath -ldflags "-s -w"`, and a +`gcr.io/distroless/static-debian12` runtime. No shell, no package +manager, no SSH; you cannot `docker exec -it sh` into it. Use +`docker exec quptime qu ...` for everything. + +## Healthcheck + +The container exits non-zero if the daemon crashes, so the default +`restart: unless-stopped` policy is enough for liveness. A more +useful readiness check requires the binary to be in your healthchecker: + +```yaml +healthcheck: + test: ["CMD", "/usr/local/bin/qu", "status"] + interval: 30s + timeout: 5s + retries: 3 + start_period: 10s +``` + +`qu status` exits 0 when the daemon socket is reachable and the +control RPC succeeds — it does **not** fail on quorum loss. That's +intentional: restarting a quorum-less node won't bring quorum back, +and a healthcheck that flaps a follower in and out of `unhealthy` +state every time the master is briefly unreachable is worse than no +check. If you want a stricter readiness signal, pipe `qu status` +through `grep -q 'quorum true'`. diff --git a/docs/deployment/public-internet.md b/docs/deployment/public-internet.md new file mode 100644 index 0000000..a7fd80d --- /dev/null +++ b/docs/deployment/public-internet.md @@ -0,0 +1,180 @@ +# Deployment: public-internet exposure + +If your nodes do not share a private network and you can't put an +overlay between them (see [tailscale.md](tailscale.md)), this is the +recipe for exposing TCP/9901 directly to the open internet without +losing sleep. + +The short version: `qu` is designed for this — every inbound call is +mTLS-pinned at the application layer and gated by the cluster secret +— but defence in depth is cheap and you should take it. + +## Threat model in one paragraph + +Anyone on the internet can establish a TLS connection to `:9901` +because the daemon must accept handshakes from currently-untrusted +peers (otherwise no node could ever join). The RPC dispatcher then +rejects every method except `Join` for callers whose fingerprint +isn't in `trust.yaml`. `Join` itself is gated by the **cluster +secret**, compared in constant time. So the realistic attack surface +is: + +1. The TLS 1.3 stack accepting handshakes from arbitrary peers. +2. The `Join` handler's secret check and downstream cert ingestion. +3. The blast radius of a leaked cluster secret (an attacker who has + it can enrol themselves as a peer and propose mutations, which is + game over). + +What can't trivially happen: + +- A random attacker observing or modifying cluster traffic — TLS 1.3 + with fingerprint pinning sees to that. +- A random attacker calling any method other than `Join` — the RPC + dispatcher refuses. + +What you should still do: + +- Treat `node.yaml.cluster_secret` like an SSH host key. Out-of-band + distribution only. Never in git, never in CI logs, never in chat. +- Rate-limit and IP-allowlist where you can. The `Join` handler does + not currently rate-limit at the application layer, so a determined + attacker could try secrets at TLS-handshake rate. +- Run on a non-default port if your operations workflow allows it. + Doesn't add security, but reduces background internet noise in the + logs and makes IDS / WAF rules cleaner. + +## Firewall + +### nftables (recommended) + +A drop-in `/etc/nftables.d/quptime.nft`: + +```nft +table inet filter { + set quptime_peers { + type ipv4_addr + elements = { 198.51.100.10, 198.51.100.11, 198.51.100.12 } + } + + chain quptime_input { + # Drop everything that didn't come from a known peer. + ip saddr @quptime_peers tcp dport 9901 accept + tcp dport 9901 log prefix "quptime-drop: " level info drop + } + + chain input { + type filter hook input priority 0; policy drop; + ct state established,related accept + iif lo accept + jump quptime_input + # ... your other rules + } +} +``` + +The allowlist is the highest-ROI mitigation by far — if you maintain +fixed IPs for your monitor nodes, use this and move on. + +### ufw + +```sh +sudo ufw allow from 198.51.100.10 to any port 9901 proto tcp +sudo ufw allow from 198.51.100.11 to any port 9901 proto tcp +sudo ufw allow from 198.51.100.12 to any port 9901 proto tcp +``` + +### Dynamic peer IPs + +If peer IPs aren't fixed (e.g., one node is on a home connection with +a rotating address), you have three options ranked by preference: + +1. Use an overlay instead — see [tailscale.md](tailscale.md). This is + the right answer. +2. DNS-based allowlisting (`ipset`-from-DNS or a small reconciler that + re-resolves an allowlist hostname every minute). Beware: a + compromised DNS resolver becomes a compromise of the allowlist. +3. Drop the allowlist and rely solely on the cluster secret + mTLS. + This is what `qu` is designed to survive; just be sure the secret + actually has the entropy `qu init` generated for it (32 random + bytes, base64-encoded). + +## Rate-limiting failed handshakes + +`qu` does not currently rate-limit `Join` attempts at the application +layer. You can do it at the firewall, which catches both connect +floods and slow brute-force: + +```nft +table inet filter { + chain quptime_input { + tcp dport 9901 ct state new \ + meter quptime_ratemeter { ip saddr limit rate over 10/second } \ + log prefix "quptime-rate: " drop + tcp dport 9901 accept + } +} +``` + +Or `fail2ban` with a tiny custom filter that watches `journalctl -u +quptime` for repeated `peer rejected join` lines: + +```ini +# /etc/fail2ban/filter.d/quptime.conf +[Definition] +failregex = ^.*quptime:.*peer rejected join.*from .*$ +``` + +```ini +# /etc/fail2ban/jail.d/quptime.local +[quptime] +enabled = true +filter = quptime +backend = systemd +journalmatch = _SYSTEMD_UNIT=quptime.service +maxretry = 3 +findtime = 600 +bantime = 86400 +``` + +Note: the daemon doesn't currently log the *peer address* on rejected +joins. The log filter above is illustrative; check what your version +actually emits before relying on it. + +## Secret hygiene + +The single most important thing on a public-internet deployment: + +- **Generate the secret on the first node.** `qu init` with no + `--secret` produces 32 random bytes from `crypto/rand`, base64- + encoded. Don't replace that with something memorable. +- **Transport out of band.** Paste it into your secret manager + immediately; share via 1Password / Vault / encrypted email. +- **Rotate if anyone with access has left.** Rotation isn't a CLI + command; do it the brute-force way: `qu init` a fresh cluster on + new ports, re-add every check via `cluster.yaml` export, swap DNS. +- **One secret per cluster.** Do not reuse the secret across staging + and prod, or across customers if you run several clusters. + +## Non-default ports + +```sh +# Each node, in node.yaml — or pass --port on init. +qu init --advertise alpha.example.com:51234 --port 51234 +``` + +Open the corresponding firewall rule, restart the daemon. The +cluster doesn't require uniform ports across nodes; each peer's +`advertise` field tells everyone else what to dial. + +## What you should monitor on a public deployment + +- `term` from `qu status` — if it's ticking up frequently the master + is flapping, which probably means at least one peer's network is + unstable. Could be benign, could be a probe attempt. +- The firewall drop counter on the `quptime-drop` rule above. +- The number of TLS handshakes on `:9901`. A spike in handshakes that + don't progress to a successful RPC is the signature of a brute-force + on the cluster secret. + +For the operational side — backups, upgrades, recovery — see +[operations.md](../operations.md). diff --git a/docs/deployment/systemd.md b/docs/deployment/systemd.md new file mode 100644 index 0000000..f08a466 --- /dev/null +++ b/docs/deployment/systemd.md @@ -0,0 +1,250 @@ +# Deployment: systemd on bare metal / VM + +The canonical way to run `qu` on a Linux host. Single static binary, +managed by systemd, with a hardened unit file. Most production users +should start here. + +## Audience and assumptions + +- You have root (or `sudo`) on the host. +- You have at least three hosts that can reach each other on TCP/9901. + (Three is the minimum for a useful quorum; fewer is fine for + development but a 2-node cluster offers no consensus protection.) +- The hosts have a way to authenticate each other — direct IP or a + resolvable hostname is fine. For overlay networks see + [tailscale.md](tailscale.md). + +## Install the binary + +See [installation.md](../installation.md). The official `install.sh` +script writes a *minimal* unit file that's fine for development. For +production replace it with the hardened version below. + +## Create a dedicated user + +Running as a dedicated unprivileged user is best practice, but ICMP +support adds a wrinkle — see the next section. + +```sh +sudo useradd --system --no-create-home --shell /usr/sbin/nologin quptime +sudo install -d -o quptime -g quptime -m 0750 /etc/quptime +sudo install -d -o quptime -g quptime -m 0750 /var/run/quptime +``` + +## ICMP capabilities + +ICMP probes have two implementations: + +1. **Unprivileged UDP pings** — Linux's `dgram` ICMP socket. Works on + any modern kernel without elevated privileges, but only if + `net.ipv4.ping_group_range` includes the daemon's GID. This is the + default in `qu`. +2. **Raw ICMP** — requires `CAP_NET_RAW`, more accurate latency + numbers and works for IPv6 from arbitrary kernels. + +The simplest path: stick with unprivileged pings and widen +`ping_group_range`. Sysctl, persistent across reboots: + +```sh +# /etc/sysctl.d/10-quptime.conf +net.ipv4.ping_group_range = 0 2147483647 +``` + +```sh +sudo sysctl --system +``` + +If you need raw ICMP instead, grant the capability on the binary: + +```sh +sudo setcap cap_net_raw=+ep /usr/local/bin/qu +``` + +Note that `setcap` is overwritten by every `qu` upgrade — bake the +`setcap` call into your deploy script, or re-run it after each +package update. + +## Hardened unit file + +Drop this in `/etc/systemd/system/quptime.service`: + +```ini +[Unit] +Description=QUptime distributed uptime monitor +Documentation=https://git.cer.sh/axodouble/quptime +Wants=network-online.target +After=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/qu serve +Restart=always +RestartSec=5s + +User=quptime +Group=quptime + +# Where state lives. RuntimeDirectory creates /var/run/quptime/ each +# boot owned by User:Group with mode 0750. +Environment=QUPTIME_DIR=/etc/quptime +RuntimeDirectory=quptime +RuntimeDirectoryMode=0750 +ReadWritePaths=/etc/quptime /var/run/quptime + +# Hardening. Comment out individual directives if a probe needs +# something we've revoked. +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +PrivateDevices=true +ProtectKernelTunables=true +ProtectKernelModules=true +ProtectControlGroups=true +ProtectClock=true +ProtectHostname=true +RestrictNamespaces=true +RestrictRealtime=true +RestrictSUIDSGID=true +LockPersonality=true +MemoryDenyWriteExecute=true + +# Network access is required (we're a network monitor). Keep address +# families minimal — AF_NETLINK is needed for some libc lookups. +RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 AF_NETLINK + +# If you need raw ICMP, *also* uncomment: +# AmbientCapabilities=CAP_NET_RAW +# CapabilityBoundingSet=CAP_NET_RAW +# Otherwise drop all capabilities: +CapabilityBoundingSet= + +[Install] +WantedBy=multi-user.target +``` + +Reload systemd and enable: + +```sh +sudo systemctl daemon-reload +sudo systemctl enable quptime.service +``` + +## Initialise the node + +**Don't start the service yet** — `qu init` must run first, and it +must run as the `quptime` user so it creates files with the right +ownership. + +On the **first** host (it will print a secret; copy it): + +```sh +sudo -u quptime QUPTIME_DIR=/etc/quptime \ + qu init --advertise alpha.example.com:9901 +``` + +On every **other** host (paste the secret): + +```sh +sudo -u quptime QUPTIME_DIR=/etc/quptime \ + qu init --advertise bravo.example.com:9901 --secret '' + +sudo -u quptime QUPTIME_DIR=/etc/quptime \ + qu init --advertise charlie.example.com:9901 --secret '' +``` + +## Open the firewall + +`qu` needs TCP/9901 reachable between cluster members. Adjust to your +firewall: + +```sh +# ufw +sudo ufw allow from to any port 9901 proto tcp + +# firewalld +sudo firewall-cmd --permanent --zone=internal \ + --add-rich-rule='rule family=ipv4 source address= port port=9901 protocol=tcp accept' +sudo firewall-cmd --reload + +# nftables (drop-in) +table inet filter { + chain input { + ip saddr { 10.0.0.10, 10.0.0.11, 10.0.0.12 } tcp dport 9901 accept + } +} +``` + +For exposing 9901 to the open internet see +[public-internet.md](public-internet.md). + +## Start the daemon + +```sh +sudo systemctl start quptime +sudo systemctl status quptime +journalctl -u quptime -f +``` + +## Invite peers + +From one node (typically `alpha`): + +```sh +sudo -u quptime qu node add bravo.example.com:9901 +# Pause a few seconds so heartbeats reach the new peer before the next add — +# otherwise the "needs ≥2 live to mutate" check rejects the second invite. +sudo -u quptime qu node add charlie.example.com:9901 +``` + +`qu node add` prints each remote's fingerprint and asks for SSH-style +confirmation. Verify it matches an out-of-band channel (the remote +operator can show their fingerprint with +`sudo -u quptime qu status` or by reading `trust.yaml`). + +## Verify + +```sh +sudo -u quptime qu status +``` + +Expect to see all three peers `live=true` and one of them as +`master`. + +## Log scraping + +`journalctl -u quptime` is the canonical log stream. Notable lines: + +| Pattern | Meaning | +| ------------------------------------------------------------- | --------------------------------------------------------- | +| `listening on ... as node ...` | Daemon up. | +| `manual-edit: cluster.yaml changed externally — replicating…` | An operator edited `cluster.yaml` directly. | +| `manual-edit: parse cluster.yaml: ...` | Invalid YAML on disk; the operator must fix and re-save. | +| `report to master ...: ` | A follower couldn't ship a probe result to the master. | +| `replicate: pull from ...: ` | A follower couldn't pull a higher-version config snapshot. | + +## Sample reload / restart drill + +After editing the unit file: + +```sh +sudo systemctl daemon-reload +sudo systemctl restart quptime +``` + +After editing `cluster.yaml` by hand: + +```sh +sudoedit /etc/quptime/cluster.yaml +# No restart needed — the watcher picks it up within 2s and pushes to master. +``` + +After upgrading the binary: + +```sh +sudo install -m 0755 qu-new /usr/local/bin/qu +sudo setcap cap_net_raw=+ep /usr/local/bin/qu # if you use raw ICMP +sudo systemctl restart quptime +``` + +Doing rolling upgrades? See [operations.md](../operations.md). diff --git a/docs/deployment/tailscale.md b/docs/deployment/tailscale.md new file mode 100644 index 0000000..1b6ae43 --- /dev/null +++ b/docs/deployment/tailscale.md @@ -0,0 +1,181 @@ +# Deployment: Tailscale / WireGuard overlay + +When your nodes live in different networks — different VPS providers, +different physical sites, a mix of home and cloud — exposing TCP/9901 +to the open internet is a poor idea. An overlay network gives every +node a stable private IP regardless of NAT, and `qu` only needs to +listen on that overlay address. + +This page focuses on Tailscale because the repo ships an example +compose for it, but everything generalises to WireGuard, Nebula, or a +self-hosted Headscale. + +## The big idea + +``` ++--- host A (VPS, no public ICMP) ----+ +| tailscale ←→ overlay ip 100.64.1.1 | +| qu listening on 100.64.1.1:9901 | ++-------------------------------------+ + │ mTLS over overlay + ▼ ++--- host B (homelab behind NAT) -----+ +| tailscale ←→ overlay ip 100.64.1.2 | +| qu listening on 100.64.1.2:9901 | ++-------------------------------------+ +``` + +`bind_addr` is set to the tailscale IP, the host's public interface +has no port 9901 open, and the cluster secret + mTLS handshake gate +the link inside the tunnel. + +## Compose recipe + +The repo ships [`docker/docker-compose-tailscale.yml`](../../docker/docker-compose-tailscale.yml). +The relevant trick is `network_mode: "service:tailscale"` — the +`quptime` container shares the network namespace of the `tailscale` +sidecar so it sees the tailnet as its own interface. + +```yaml +services: + tailscale: + image: tailscale/tailscale:latest + container_name: tailscale + cap_add: [NET_ADMIN] + environment: + - TS_AUTHKEY=${TAILSCALE_AUTHKEY} # provision via .env + - TS_HOSTNAME=quptime-${HOST} # name visible in admin + volumes: + - /dev/net/tun:/dev/net/tun + - tailscale:/var/lib/tailscale + restart: unless-stopped + + quptime: + image: git.cer.sh/axodouble/quptime:v0.1.0 + container_name: quptime + volumes: + - quptime:/etc/quptime + network_mode: "service:tailscale" + depends_on: [tailscale] + cap_add: [NET_RAW] + # No restart directive yet — needs `qu init` first. + +volumes: + tailscale: + quptime: +``` + +### One-time bootstrap + +Each host runs the same script with different `HOST` and `TAILSCALE_AUTHKEY`: + +```sh +# .env +HOST=alpha +TAILSCALE_AUTHKEY=tskey-auth-xxxxxxxx +``` + +Start Tailscale alone first so it gets an IP: + +```sh +docker compose up -d tailscale +sleep 5 +TSIP=$(docker compose exec tailscale tailscale ip --4) +echo "this node's tailnet IP: $TSIP" +``` + +On the **first** host, init without `--secret`: + +```sh +docker compose run --rm quptime init --advertise "$TSIP:9901" +# Grab the printed secret; pipe through your password manager. +``` + +On every **other** host, paste the secret: + +```sh +docker compose run --rm quptime init \ + --advertise "$TSIP:9901" \ + --secret "$CLUSTER_SECRET" +``` + +Then bring up `qu` on every node and invite from the first: + +```sh +# Each host +docker compose up -d quptime + +# From alpha +docker compose exec quptime qu node add 100.64.1.2:9901 +sleep 3 +docker compose exec quptime qu node add 100.64.1.3:9901 + +docker compose exec quptime qu status +``` + +## Tailscale ACLs + +Belt and braces — even though mTLS pins identities, lock down the +tailnet itself so only the `qu` nodes can reach each other's :9901. +In the Tailscale admin console: + +```jsonc +{ + "tagOwners": { "tag:qu-node": ["group:ops"] }, + "acls": [ + { + "action": "accept", + "src": ["tag:qu-node"], + "dst": ["tag:qu-node:9901"] + } + // ...your other rules + ] +} +``` + +Then tag every `qu` node in its auth key: + +```yaml +environment: + - TS_AUTHKEY=${TAILSCALE_AUTHKEY}?ephemeral=false&tags=tag:qu-node +``` + +## WireGuard / Nebula / Headscale equivalents + +The recipe generalises: + +1. Provision the overlay interface on each host with a stable + private IP (the tunnel's own address). +2. `qu init --advertise :9901`. +3. Set `bind_addr: ` in `node.yaml` so the daemon does + **not** also listen on the public interface. +4. Open `:9901` only on the overlay interface in your firewall — for + nftables that's something like `iifname "wg0" tcp dport 9901 + accept`. + +The cluster secret and mTLS fingerprints still apply; the overlay just +removes the open-internet attack surface. + +## Why prefer overlay over public exposure + +- Single failure domain at the network layer: an attacker who finds an + exploit in your overlay client (rare; Tailscale and WireGuard are + small surfaces) still hits the application-layer pinning before any + cluster-level operation. +- The cluster secret can be lower-entropy when it's already + unreachable from outside. (You should still treat it as a real + secret; "defence in depth" only works if every layer is real.) +- ICMP probes from a homelab to a target on the public internet are + trivial through NAT, but ICMP *into* a homelab usually isn't. + Running `qu` on a tailnet means peers can heartbeat each other + regardless of NAT direction. + +## Trade-offs + +- One more thing to monitor. If your tailnet is down, your monitor is + down. Counter-measure: run *another* tiny `qu` cluster (or a single + node) on the public internet that watches the overlay's coordinator + health. +- Probe latency includes the overlay's hop. Tailscale's wireguard is + fast (<1 ms LAN, single-digit ms WAN) so this rarely matters, but + if you're alerting on tight latency thresholds, account for it. diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 0000000..71ac850 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,104 @@ +# Installation + +`qu` ships as a single static Linux binary. Pick whichever method +matches how you manage software on the host. + +> Choosing a deployment recipe instead? Jump to +> [systemd](deployment/systemd.md), [Docker](deployment/docker.md), +> [Tailscale](deployment/tailscale.md), or +> [public-internet](deployment/public-internet.md). + +## Pre-built binary (recommended) + +Releases are published to the [Gitea releases +page](https://git.cer.sh/axodouble/quptime/releases) with a +`SHA256SUMS` file. Two architectures are built: `linux-amd64` and +`linux-arm64`. + +```sh +# Always pin to a tag — `latest` resolves on the server side. +TAG=v0.1.0 +ARCH=amd64 # or arm64 + +curl -fSL -o qu \ + "https://git.cer.sh/axodouble/quptime/releases/download/${TAG}/qu-${TAG}-linux-${ARCH}" +curl -fSL -o SHA256SUMS \ + "https://git.cer.sh/axodouble/quptime/releases/download/${TAG}/SHA256SUMS" + +# Verify before installing. +sha256sum --check --ignore-missing SHA256SUMS + +install -m 0755 qu /usr/local/bin/qu +``` + +## One-line install script + +The repo ships an `install.sh` that handles the download, checksum, +shell-completion installation, and a default systemd unit file. Run it +under `sudo` so it can write to `/usr/local/bin` and +`/etc/systemd/system`. + +```sh +curl -fsSL https://git.cer.sh/Axodouble/QUptime/raw/branch/master/install.sh | sudo bash +``` + +What it does: + +1. Looks up the latest release via the Gitea API. +2. Downloads the binary to `/usr/local/bin/qu`. +3. Installs bash / zsh / fish completion if a target directory exists. +4. Writes `/etc/systemd/system/qu-serve.service` and enables it (but + does **not** start it — you need to run `qu init` first). + +The unit it writes is minimal. For a production unit with hardening, +see the [systemd deployment guide](deployment/systemd.md). + +## Build from source + +Requires Go 1.24.2 or newer. + +```sh +git clone https://git.cer.sh/axodouble/quptime.git +cd quptime +go build -ldflags "-X main.version=$(git describe --tags --always)" -o qu ./cmd/qu + +./qu --version +``` + +Static binary, no cgo. `CGO_ENABLED=0` is the default on a clean Go +install; if you've enabled cgo globally, set it explicitly: + +```sh +CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" -o qu ./cmd/qu +``` + +## Docker image + +A multi-arch (`amd64` + `arm64`) image is published to the Gitea +registry on every tag and every push to `master`: + +``` +git.cer.sh/axodouble/quptime:master # tip of main +git.cer.sh/axodouble/quptime:v0.1.0 # tagged release +``` + +See the [Docker deployment guide](deployment/docker.md) for compose +files and volume layout. + +## Verifying the install + +```sh +qu --version +qu --help +``` + +If completions installed, `qu ` will list subcommands. After +`qu init` you can run `qu status` to confirm the daemon is reachable +over its control socket. + +## Next steps + +- [Configure the node and the cluster](configuration.md). +- Pick a deployment recipe under [docs/deployment/](deployment/). +- Walk through the [architecture](architecture.md) so the operational + guarantees are clear before you commit to a topology. diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 0000000..185c4db --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,225 @@ +# Operations + +Day-2 tasks: keeping `qu` healthy, upgrading without dropping checks, +backing up state, recovering from failures. Pair this with +[troubleshooting.md](troubleshooting.md) for "the cluster is on fire, +what now" specifics. + +## Upgrades + +### Rolling upgrade (zero alert loss) + +`qu` is built to tolerate one node being absent at a time as long as +quorum still holds. The simple recipe for a 3-node cluster: + +```sh +# On each node in turn: +sudo systemctl stop quptime +sudo install -m 0755 qu-new /usr/local/bin/qu +sudo setcap cap_net_raw=+ep /usr/local/bin/qu # if you use raw ICMP +sudo systemctl start quptime + +# Wait for the node to rejoin before moving on: +sudo -u quptime qu status # should show quorum true, all peers live +``` + +The first node you upgrade may briefly be a follower with a *higher* +binary version than the master. That's fine as long as no on-disk +format changes; the wire protocol and `cluster.yaml` schema are +stable within a minor version, so minor / patch upgrades freely +interleave. + +For major-version upgrades that change the on-disk format, the release +notes will spell out the migration. As of v0 there have been none. + +### Downgrades + +A node that downgrades to an older binary will refuse to start if +`cluster.yaml` contains fields the older version doesn't know. To +roll back across a schema change, either: + +- Take the cluster offline and downgrade all nodes simultaneously. +- Restore a `cluster.yaml` from before the schema change on every node + before starting the downgraded binary. + +Within a single minor version, downgrade is symmetrical with upgrade. + +### What can go wrong + +- **Restarting two nodes at once in a 3-node cluster** loses quorum. + No mutations succeed, no alerts fire. Quorum returns the moment + the second node is back. +- **A node that has been offline for a long time** comes back with a + stale `cluster.yaml`. It will pull the master's higher version + within ~1 heartbeat. Don't pre-emptively delete its `cluster.yaml` + — let the catch-up path handle it. + +## Backups + +Three files matter, in descending order of "pain if lost": + +| File | Why back it up | +| ---------------------- | -------------------------------------------------------------------- | +| `node.yaml` | Holds the cluster secret. Lose it and the node can't rejoin. | +| `keys/private.pem` | Lose it and you must `qu init` a fresh identity and re-trust. | +| `cluster.yaml` | Resyncs from any other live peer, so per-node backup is optional. | + +### Per-host backup + +```sh +# /etc/cron.daily/quptime-backup +#!/bin/sh +set -eu +dst=/var/backups/quptime/$(date +%Y%m%d) +mkdir -p "$dst" +cp -a /etc/quptime/node.yaml "$dst/" +cp -a /etc/quptime/keys "$dst/keys" +cp -a /etc/quptime/cluster.yaml "$dst/cluster.yaml" +chmod -R go-rwx "$dst" +``` + +### Cluster-wide backup + +The cluster state (`peers`, `checks`, `alerts`) is identical across +every node. Back up one healthy node's `cluster.yaml` and you have +the canonical copy. To restore: + +```sh +# Stop the daemon. +sudo systemctl stop quptime + +# Drop in the backup. Reset the version to 0 so the running cluster's +# higher version supersedes whatever you're holding — otherwise this +# node will broadcast a stale snapshot and confuse everyone. +sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml +sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml + +sudo systemctl start quptime +# Within seconds the version-observer pulls the live version from a peer. +``` + +If you're restoring **the entire cluster** (every node lost), the +"reset version to 0" trick doesn't apply — there's no peer with a +higher version. Pick the highest-version backup, restore that file +across every node verbatim, and start the daemons. The cluster will +elect a master and continue. + +## Replacing a dead node + +A node has died permanently. You want to add a fresh box with the +same role. + +1. On a surviving node, evict the dead one: + + ```sh + sudo -u quptime qu node remove + ``` + + This drops it from `cluster.yaml` and removes its trust entry. The + live set's size shrinks by one — verify quorum still holds. + +2. On the new host, install `qu` and `qu init` against the existing + cluster secret: + + ```sh + sudo -u quptime qu init \ + --advertise delta.example.com:9901 \ + --secret '' + sudo systemctl start quptime + ``` + +3. From a surviving node, invite the new one: + + ```sh + sudo -u quptime qu node add delta.example.com:9901 + ``` + +The dead node's checks and alerts are unaffected — they live in the +replicated `cluster.yaml`, not the dead node's identity. + +## Recovering from lost quorum + +You've lost more than half the cluster simultaneously. The remaining +nodes refuse to mutate (correct behaviour: they have no way to know +whether the missing nodes are dead or partitioned). + +Options: + +- **Bring the missing nodes back.** Always the right first move if it's + possible. The cluster recovers automatically once enough nodes are + live. +- **Shrink the cluster.** If you've genuinely lost the missing nodes + permanently and can't bring them back, you need to manually edit + `cluster.yaml` on every surviving node to remove the dead peers, + then restart. Be very deliberate: + + ```sh + # On each surviving node: + sudo systemctl stop quptime + sudoedit /etc/quptime/cluster.yaml # delete the dead peers[] entries + # bump version to something higher + sudo systemctl start quptime + ``` + + Make sure every surviving node has identical `cluster.yaml` content + before restarting any of them. If they don't, you'll get conflicting + views of who's in the cluster and elections will flap. + +- **Start over.** For small clusters this is often faster than the + manual surgery above: `rm -rf /etc/quptime` everywhere, then + bootstrap from scratch. You'll lose your checks and alerts unless + you saved a copy of `cluster.yaml` elsewhere. + +## Monitoring `qu` itself + +`qu` watches your services. Who watches `qu`? + +### From within the cluster + +`qu status` is the single source of truth. The fields to watch: + +| Field | Healthy | Suspicious | +| -------------- | -------------- | --------------------------------------------------------- | +| `quorum` | `true` | `false` — no mutations, no alerts. | +| `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. | +| `term` | slow growth | rapid growth → master flapping, network unstable. | +| `config ver` | identical across nodes | divergence → a node is stuck pulling. | + +A simple cron sentinel on each node: + +```sh +*/5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \ + || curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall +``` + +### From outside the cluster + +`qu` does not currently expose a Prometheus / OpenMetrics endpoint. +The recommended pattern is to run a *separate* tiny monitoring path +that doesn't depend on `qu` — even a single `curl` health check on +each node's :9901 (which is TLS-only; you'll see a handshake succeed +even if the daemon's stuck) catches process death. + +To produce structured metrics, write a sidecar that parses `qu status` +output and exports counters. The CLI emits stable, machine-grep-able +output specifically so this is straightforward. + +## Operational checklist before you go to bed + +After standing up a new cluster, work through: + +- [ ] All nodes show `quorum true` in `qu status`. +- [ ] All nodes show identical `config ver`. +- [ ] All nodes show the same `master`. +- [ ] `journalctl -u quptime --since "10 min ago"` has no + `propose to master:` or `replicate: pull from:` errors. +- [ ] `qu alert test ` reaches your inbox / Discord channel for + every configured alert. +- [ ] At least one check has an intentional failure (a bogus target) + that you flip back and forth to verify the full state-transition + → dispatch path end-to-end. +- [ ] Backups of `node.yaml` + `keys/` + `cluster.yaml` are landing in + your backup destination. +- [ ] Firewall allow-list (if any) lists every peer's IP. +- [ ] You've stored the cluster secret somewhere that survives the + first operator leaving. diff --git a/docs/security.md b/docs/security.md new file mode 100644 index 0000000..6399bd3 --- /dev/null +++ b/docs/security.md @@ -0,0 +1,153 @@ +# Security + +The trust model in one page. Read this before deciding where to put +`qu` and who can talk to it. + +## What `qu` is trying to defend against + +- **Eavesdropping on cluster traffic.** Defended: TLS 1.3 only, + fingerprint-pinned per peer. +- **MITM on the cluster's inter-node link.** Defended: TLS 1.3 with + out-of-band fingerprint verification at `qu node add`. +- **A random internet host enrolling itself as a peer.** Defended: + pre-shared cluster secret on every `Join`. +- **A compromised peer issuing forged cluster-config mutations.** Not + defended. A peer trusted enough to be in `cluster.yaml.peers` can + propose mutations through the master. Treat membership as a + privilege. +- **A compromised peer becoming master.** Election is deterministic on + the smallest live `NodeID`, so a compromised peer can become master + if its `NodeID` sorts first. The master can rewrite `cluster.yaml` + arbitrarily. This is the worst-case blast radius from one compromised + node. +- **DoS by handshake flood.** Not directly defended at the application + layer. The TLS stack accepts anyone's handshake; rate-limiting belongs + at the firewall — see [public-internet.md](deployment/public-internet.md). + +## The three secrets on disk + +| Secret | What it is | Loss impact | +| -------------------------- | ----------------------------------------- | -------------------------------------------- | +| `keys/private.pem` | RSA private key, this node's identity. | Anyone with it can impersonate this node. | +| `node.yaml.cluster_secret` | Pre-shared base64 string. | Anyone with it can `Join` the cluster. | +| `trust.yaml.entries[].cert_pem` | Other peers' public certs (not secrets, but they enable mTLS). | Loss only forces re-trust. | + +The first two are real secrets and live under `0600` permissions in +the data directory. Back them up; never commit them; never paste them +in chat. + +## TLS handshake step by step + +For every inter-node call: + +1. Caller dials peer on its `advertise` address. +2. TLS 1.3 handshake. Both sides present their self-signed leaf cert. +3. The caller's `VerifyPeerCertificate` (set in + `internal/transport/tls.go`) computes the SPKI fingerprint of the + server's cert and compares it against `trust.yaml`. If the caller + knows which `NodeID` it expected, a strict verifier ensures the + fingerprint matches *that specific* entry — not just any trusted + peer. +4. The server's TLS layer accepts any client cert (`RequireAnyClientCert`, + `InsecureSkipVerify: true`) because trust is enforced one layer up. +5. The RPC dispatcher reads the client's cert, computes its + fingerprint, and looks it up in the server's `trust.yaml`. If no + entry exists, only the `Join` method is permitted. +6. `Join` performs a constant-time comparison of the inbound + `ClusterSecret` against `node.yaml.cluster_secret`. Mismatch → + refusal. + +So: + +- An adversary who gets your **public** cert can't impersonate you. +- An adversary who gets your **fingerprint** can't impersonate you. +- An adversary who gets your **private key** *can* impersonate you to + any peer that trusts your fingerprint. + +## The TOFU step + +`qu node add ` runs a one-shot insecure dial against the +target (the only place `InsecureBootstrapConfig` is used in the +codebase, see `internal/transport/tls.go:91`). It fetches the +remote's cert, prints the fingerprint, and asks for confirmation. + +This is **identical** to SSH's first-connection prompt. The operator +must verify the fingerprint out of band — by running `qu status` on +the remote side, or by reading `keys/cert.pem` directly, or via a +known-good distribution channel. + +If you skip verification, you trust the network at that moment. If +the network was MITM'd at exactly that moment, you trust the +attacker. After the prompt, the cert is pinned and the window closes. + +## Cluster secret rotation + +There is no built-in command to rotate the cluster secret. The hard +part isn't generating a new one — it's distributing it consistently +across every node. The pragmatic recipe: + +1. Generate a new secret on one node and copy it to every other node. +2. Update `node.yaml.cluster_secret` on every node (manual edit). +3. Restart each daemon one at a time, verifying quorum returns + between restarts. + +Rotation only protects future `Join` calls, not anything else. If you +suspect the old secret has been seen by an adversary, also assume any +peer that was added during the leaked window is compromised, and +re-init those peers from scratch. + +## Identity rotation + +To roll a node's RSA keypair (e.g., the private key was on a laptop +that got stolen): + +```sh +# On the compromised node: +sudo systemctl stop quptime +sudo rm -rf /etc/quptime +sudo -u quptime qu init \ + --advertise this-host.example.com:9901 \ + --secret '' +sudo systemctl start quptime + +# On a surviving healthy node: +sudo -u quptime qu node remove # evict the old identity +sudo -u quptime qu node add this-host.example.com:9901 +``` + +The new `node_id` is a fresh UUID; the old one is gone for good. Any +historical references to it (e.g., the `updated_by` field on past +versions of `cluster.yaml`) are cosmetic. + +## What the local control socket protects + +`$XDG_RUNTIME_DIR/quptime/quptime.sock` (or `/var/run/quptime/...`) is +the channel the CLI uses to talk to the local daemon. It's `0600` +permissioned and authenticated solely by filesystem ACLs — no TLS, no +secrets in the protocol. + +Anyone who can `read+write` the socket can: + +- Propose cluster mutations (will be relayed to the master). +- Read full cluster state including `cluster.yaml`. +- Trigger test alerts. + +So: don't put the daemon's user in a group that other unprivileged +users share. The default systemd setup with a dedicated `quptime` +user gets this right. + +## Hardening checklist + +- [ ] Dedicated `quptime` system user. +- [ ] Data directory owned by that user, mode 0750. +- [ ] `keys/private.pem` mode 0600. +- [ ] `node.yaml` mode 0600. +- [ ] systemd unit uses `ProtectSystem=strict`, `NoNewPrivileges=true`, + and the rest of the hardening directives in + [systemd.md](deployment/systemd.md). +- [ ] If `:9901` is internet-reachable, firewall allow-list to peer + IPs or use an overlay — see [public-internet.md](deployment/public-internet.md) + and [tailscale.md](deployment/tailscale.md). +- [ ] Cluster secret generated by `qu init` (not chosen by a human), + stored in your secret manager. +- [ ] Backups of `keys/` and `node.yaml` are encrypted at rest. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..c0e6e6d --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,199 @@ +# Troubleshooting + +The cluster is misbehaving. This page is organised by symptom. Each +entry pairs the user-visible signal with the log line(s) you'll see +in `journalctl -u quptime` and the fix. + +## `qu status` shows `quorum false` + +**What it means.** Fewer than ⌈N/2⌉+1 peers are live. + +**Diagnose.** Look at the PEERS table. The `LIVE` column tells you +which peers this node has stopped hearing from. + +- If only this node is "live" and everyone else is not → this node is + network-isolated. Test: `nc -zv `. Fix: network / + firewall. +- If multiple nodes show false → more than one peer is down. Look at + the other peers' status outputs to triangulate. +- If everyone is live but `quorum false` still → check + `cluster.yaml.peers` length vs. live count; you may have phantom + peer entries left over from a removed-but-not-evicted node. Fix: + `qu node remove ` from any live node. + +## `qu status` shows `master (none — ...)` + +**What it means.** Either no quorum (see above) or election is in +flight. The latter clears within ~1 heartbeat. + +If `term` is incrementing rapidly (`watch qu status`), the master is +flapping. Causes: + +- The currently-elected master is unreachable from some peers but + reachable from others, partial-partition style. Look for log lines + on the suspected master about peers it can't reach. +- Heartbeat timeouts (default 4s) are too tight for your inter-node + link. Rebuild with a higher `DefaultDeadAfter` if you need it. + +## A check is stuck in `unknown` + +**What it means.** The aggregator has no fresh reports for that check. + +Possible causes: + +- No node is actually running the probe yet. Probes start ~`interval/10` + after `qu serve` boots and reconcile every 5s. Wait 10s and + re-check. +- Nodes are submitting results but they're stale (older than 3× + interval). Probably means probes are timing out without reporting. +- This is a follower's view; the aggregator runs on the master only. + Check `qu status` on the master to see the canonical view. + +## Alerts not firing + +Walk this list in order; one of them will catch it: + +1. **Is there quorum?** Aggregator runs on master only. No master → + no transitions → no alerts. +2. **Is the alert attached to the check?** `qu status` shows the + effective alert list per check. Empty → no alert. Confirm with + `qu alert list` that the alert exists and (if relying on default + attachment) has `default: true`. +3. **Is the alert suppressed on this check?** Check + `suppress_alert_ids` in `cluster.yaml`. +4. **Test the alert path directly:** + + ```sh + sudo -u quptime qu alert test + ``` + + This bypasses the aggregator and renders a synthetic transition. + If `alert test` doesn't deliver, the problem is the notifier + config or the template — see below. If `alert test` works but real + transitions don't, the aggregator isn't observing the transition. +5. **Has the check actually transitioned?** Aggregator commits a flip + only after **two consecutive** evaluations agree. A bouncing + target may never satisfy the hysteresis. Lower the check interval + or increase reliability of the target. + +## Discord webhook returns 4xx + +The dispatcher logs the HTTP body. Common causes: + +- Webhook revoked / channel deleted → 404. Re-issue and update + `discord_webhook`. +- Body too large → 400. Long templates that pull `Snapshot.Detail` + with multi-line errors can blow past Discord's 2000-char limit. + Shorten the template or trim the variable. +- Rate-limited → 429. Reduce alert frequency or stop suppressing + hysteresis. + +## SMTP refuses the message + +Check the daemon log for `smtp:` lines. Most common: + +- `530 5.7.0 Must issue a STARTTLS command first` → set + `smtp_starttls: true` on the alert. +- `535 Authentication failed` → wrong `smtp_user` / `smtp_password`. +- Connection refused / timeout → firewall between `qu` and the SMTP + relay. Verify with `openssl s_client -starttls smtp -connect host:587`. + +## Manual edit to `cluster.yaml` was ignored + +Symptoms: you edited the file, saved, nothing happened. + +Look for one of these log lines: + +- `manual-edit: parse cluster.yaml: — ignoring` → YAML is + invalid. The daemon pins the bad hash and waits for the next valid + save. Run the file through `yq` or `python -c "import yaml,sys; + yaml.safe_load(open(sys.argv[1]))" cluster.yaml` to diagnose. +- `manual-edit: cluster.yaml changed externally — replicating via + master` followed by `manual-edit: forward to master: no quorum` → + cluster has no quorum, can't accept the edit. Restore quorum first. +- *No log line at all* → the on-disk content didn't change in a way + that matters. The watcher compares only `peers`, `checks`, and + `alerts`; whitespace and comment edits are accepted silently. + +## Two nodes disagree on `config ver` + +The follower with the lower version should pull within one heartbeat. +If after ~5 seconds the gap persists: + +- The follower might not have an `advertise` address for the higher- + versioned peer. The version observer needs one to pull. Check + `cluster.yaml.peers` for both sides' `advertise` fields. +- The follower's TLS handshake against the higher-versioned peer is + failing — look for `replicate: pull from : ` lines. +- The peer with the higher version is announcing it correctly but the + follower is rejecting the `ApplyClusterCfg` broadcasts because of + its own decode error — look for transport-layer errors instead. + +## "needs ≥2 live to mutate" rejection during bootstrap + +You ran two `qu node add` commands back-to-back and the second one +failed. The first add doesn't take effect until the new peer sends +its first heartbeat (≤ 1 second); during that window the cluster has +size 2 and quorum size 2, so a *second* peer add from a 1-live +cluster looks like "mutate without quorum." + +Fix: pause ~3 seconds between adds. The README and the systemd guide +both call this out. + +## Daemon refuses to start + +``` +load node.yaml: open ...: no such file or directory +``` + +Run `qu init` before `qu serve`. The daemon does not auto-init — +silently generating identities and secrets would be a worse failure +mode than crashing. + +``` +node.yaml has empty node_id — run `qu init` first +``` + +Same fix. + +``` +listen tcp :9901: bind: address already in use +``` + +Another process owns the port. `ss -tlnp | grep :9901` to find it. + +``` +load private key: ... +``` + +Permissions on `keys/private.pem` are wrong — should be 0600 and owned +by the daemon user. Fix and restart. + +## Probes look much slower than expected + +ICMP first: + +- Default ICMP is **unprivileged UDP-mode pings**, not raw ICMP. UDP + ping is a bit slower and may hit different kernel paths. For + reference latency, grant `CAP_NET_RAW`. + +HTTP / TCP: + +- `interval` and `timeout` are the only knobs in `cluster.yaml`. The + check is run synchronously per worker; if your target takes 9 s to + respond and your timeout is 10 s, the next probe doesn't start + until ~9 s elapsed. Increase concurrency by adding more + fast-interval checks against the same target, not by lowering + timeout (which will just produce false `down` results). + +## I want to start over + +```sh +sudo systemctl stop quptime +sudo rm -rf /etc/quptime +sudo -u quptime qu init --advertise +sudo systemctl start quptime +``` + +The data directory is the only state. Wipe it and you're back to a +fresh node.