AI assisted documentation

2026-05-15 04:05:30 +00:00
parent 364ba222e2
commit 6953709574
12 changed files with 2029 additions and 0 deletions
@@ -0,0 +1,225 @@
+# Operations
+
+Day-2 tasks: keeping `qu` healthy, upgrading without dropping checks,
+backing up state, recovering from failures. Pair this with
+[troubleshooting.md](troubleshooting.md) for "the cluster is on fire,
+what now" specifics.
+
+## Upgrades
+
+### Rolling upgrade (zero alert loss)
+
+`qu` is built to tolerate one node being absent at a time as long as
+quorum still holds. The simple recipe for a 3-node cluster:
+
+```sh
+# On each node in turn:
+sudo systemctl stop quptime
+sudo install -m 0755 qu-new /usr/local/bin/qu
+sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
+sudo systemctl start quptime
+
+# Wait for the node to rejoin before moving on:
+sudo -u quptime qu status   # should show quorum true, all peers live
+```
+
+The first node you upgrade may briefly be a follower with a *higher*
+binary version than the master. That's fine as long as no on-disk
+format changes; the wire protocol and `cluster.yaml` schema are
+stable within a minor version, so minor / patch upgrades freely
+interleave.
+
+For major-version upgrades that change the on-disk format, the release
+notes will spell out the migration. As of v0 there have been none.
+
+### Downgrades
+
+A node that downgrades to an older binary will refuse to start if
+`cluster.yaml` contains fields the older version doesn't know. To
+roll back across a schema change, either:
+
+- Take the cluster offline and downgrade all nodes simultaneously.
+- Restore a `cluster.yaml` from before the schema change on every node
+  before starting the downgraded binary.
+
+Within a single minor version, downgrade is symmetrical with upgrade.
+
+### What can go wrong
+
+- **Restarting two nodes at once in a 3-node cluster** loses quorum.
+  No mutations succeed, no alerts fire. Quorum returns the moment
+  the second node is back.
+- **A node that has been offline for a long time** comes back with a
+  stale `cluster.yaml`. It will pull the master's higher version
+  within ~1 heartbeat. Don't pre-emptively delete its `cluster.yaml`
+  — let the catch-up path handle it.
+
+## Backups
+
+Three files matter, in descending order of "pain if lost":
+
+| File                   | Why back it up                                                       |
+| ---------------------- | -------------------------------------------------------------------- |
+| `node.yaml`            | Holds the cluster secret. Lose it and the node can't rejoin.         |
+| `keys/private.pem`     | Lose it and you must `qu init` a fresh identity and re-trust.        |
+| `cluster.yaml`         | Resyncs from any other live peer, so per-node backup is optional.    |
+
+### Per-host backup
+
+```sh
+# /etc/cron.daily/quptime-backup
+#!/bin/sh
+set -eu
+dst=/var/backups/quptime/$(date +%Y%m%d)
+mkdir -p "$dst"
+cp -a /etc/quptime/node.yaml         "$dst/"
+cp -a /etc/quptime/keys              "$dst/keys"
+cp -a /etc/quptime/cluster.yaml      "$dst/cluster.yaml"
+chmod -R go-rwx "$dst"
+```
+
+### Cluster-wide backup
+
+The cluster state (`peers`, `checks`, `alerts`) is identical across
+every node. Back up one healthy node's `cluster.yaml` and you have
+the canonical copy. To restore:
+
+```sh
+# Stop the daemon.
+sudo systemctl stop quptime
+
+# Drop in the backup. Reset the version to 0 so the running cluster's
+# higher version supersedes whatever you're holding — otherwise this
+# node will broadcast a stale snapshot and confuse everyone.
+sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml
+sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml
+
+sudo systemctl start quptime
+# Within seconds the version-observer pulls the live version from a peer.
+```
+
+If you're restoring **the entire cluster** (every node lost), the
+"reset version to 0" trick doesn't apply — there's no peer with a
+higher version. Pick the highest-version backup, restore that file
+across every node verbatim, and start the daemons. The cluster will
+elect a master and continue.
+
+## Replacing a dead node
+
+A node has died permanently. You want to add a fresh box with the
+same role.
+
+1. On a surviving node, evict the dead one:
+
+   ```sh
+   sudo -u quptime qu node remove <dead-node-id>
+   ```
+
+   This drops it from `cluster.yaml` and removes its trust entry. The
+   live set's size shrinks by one — verify quorum still holds.
+
+2. On the new host, install `qu` and `qu init` against the existing
+   cluster secret:
+
+   ```sh
+   sudo -u quptime qu init \
+     --advertise delta.example.com:9901 \
+     --secret '<existing cluster secret>'
+   sudo systemctl start quptime
+   ```
+
+3. From a surviving node, invite the new one:
+
+   ```sh
+   sudo -u quptime qu node add delta.example.com:9901
+   ```
+
+The dead node's checks and alerts are unaffected — they live in the
+replicated `cluster.yaml`, not the dead node's identity.
+
+## Recovering from lost quorum
+
+You've lost more than half the cluster simultaneously. The remaining
+nodes refuse to mutate (correct behaviour: they have no way to know
+whether the missing nodes are dead or partitioned).
+
+Options:
+
+- **Bring the missing nodes back.** Always the right first move if it's
+  possible. The cluster recovers automatically once enough nodes are
+  live.
+- **Shrink the cluster.** If you've genuinely lost the missing nodes
+  permanently and can't bring them back, you need to manually edit
+  `cluster.yaml` on every surviving node to remove the dead peers,
+  then restart. Be very deliberate:
+
+  ```sh
+  # On each surviving node:
+  sudo systemctl stop quptime
+  sudoedit /etc/quptime/cluster.yaml   # delete the dead peers[] entries
+                                        # bump version to something higher
+  sudo systemctl start quptime
+  ```
+
+  Make sure every surviving node has identical `cluster.yaml` content
+  before restarting any of them. If they don't, you'll get conflicting
+  views of who's in the cluster and elections will flap.
+
+- **Start over.** For small clusters this is often faster than the
+  manual surgery above: `rm -rf /etc/quptime` everywhere, then
+  bootstrap from scratch. You'll lose your checks and alerts unless
+  you saved a copy of `cluster.yaml` elsewhere.
+
+## Monitoring `qu` itself
+
+`qu` watches your services. Who watches `qu`?
+
+### From within the cluster
+
+`qu status` is the single source of truth. The fields to watch:
+
+| Field          | Healthy        | Suspicious                                                |
+| -------------- | -------------- | --------------------------------------------------------- |
+| `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
+| `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
+| `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
+| `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |
+
+A simple cron sentinel on each node:
+
+```sh
+*/5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \
+  || curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall
+```
+
+### From outside the cluster
+
+`qu` does not currently expose a Prometheus / OpenMetrics endpoint.
+The recommended pattern is to run a *separate* tiny monitoring path
+that doesn't depend on `qu` — even a single `curl` health check on
+each node's :9901 (which is TLS-only; you'll see a handshake succeed
+even if the daemon's stuck) catches process death.
+
+To produce structured metrics, write a sidecar that parses `qu status`
+output and exports counters. The CLI emits stable, machine-grep-able
+output specifically so this is straightforward.
+
+## Operational checklist before you go to bed
+
+After standing up a new cluster, work through:
+
+- [ ] All nodes show `quorum true` in `qu status`.
+- [ ] All nodes show identical `config ver`.
+- [ ] All nodes show the same `master`.
+- [ ] `journalctl -u quptime --since "10 min ago"` has no
+      `propose to master:` or `replicate: pull from:` errors.
+- [ ] `qu alert test <name>` reaches your inbox / Discord channel for
+      every configured alert.
+- [ ] At least one check has an intentional failure (a bogus target)
+      that you flip back and forth to verify the full state-transition
+      → dispatch path end-to-end.
+- [ ] Backups of `node.yaml` + `keys/` + `cluster.yaml` are landing in
+      your backup destination.
+- [ ] Firewall allow-list (if any) lists every peer's IP.
+- [ ] You've stored the cluster secret somewhere that survives the
+      first operator leaving.