Files
QUptime/docs/operations.md
T
Axodouble 1e2e382867
Container image / image (push) Successful in 1m40s
Updated docs, readme, & changelog
2026-05-15 07:36:01 +00:00

227 lines
8.4 KiB
Markdown

# Operations
Day-2 tasks: keeping `qu` healthy, upgrading without dropping checks,
backing up state, recovering from failures. Pair this with
[troubleshooting.md](troubleshooting.md) for "the cluster is on fire,
what now" specifics.
## Upgrades
### Rolling upgrade (zero alert loss)
`qu` is built to tolerate one node being absent at a time as long as
quorum still holds. The simple recipe for a 3-node cluster:
```sh
# On each node in turn:
sudo systemctl stop quptime
sudo install -m 0755 qu-new /usr/local/bin/qu
sudo setcap cap_net_raw=+ep /usr/local/bin/qu # if you use raw ICMP
sudo systemctl start quptime
# Wait for the node to rejoin before moving on:
sudo -u quptime qu status # should show quorum true, all peers live
```
The first node you upgrade may briefly be a follower with a *higher*
binary version than the master. That's fine as long as no on-disk
format changes; the wire protocol and `cluster.yaml` schema are
stable within a minor version, so minor / patch upgrades freely
interleave.
For major-version upgrades that change the on-disk format, the release
notes will spell out the migration. As of v0 there have been none.
### Downgrades
A node that downgrades to an older binary will refuse to start if
`cluster.yaml` contains fields the older version doesn't know. To
roll back across a schema change, either:
- Take the cluster offline and downgrade all nodes simultaneously.
- Restore a `cluster.yaml` from before the schema change on every node
before starting the downgraded binary.
Within a single minor version, downgrade is symmetrical with upgrade.
### What can go wrong
- **Restarting two nodes at once in a 3-node cluster** loses quorum.
No mutations succeed, no alerts fire. Quorum returns the moment
the second node is back.
- **A node that has been offline for a long time** comes back with a
stale `cluster.yaml`. It will pull the master's higher version
within ~1 heartbeat. Don't pre-emptively delete its `cluster.yaml`
— let the catch-up path handle it.
## Backups
Three files matter, in descending order of "pain if lost":
| File | Why back it up |
| ---------------------- | -------------------------------------------------------------------- |
| `node.yaml` | Holds the cluster secret. Lose it and the node can't rejoin. |
| `keys/private.pem` | Lose it and you must `qu init` a fresh identity and re-trust. |
| `cluster.yaml` | Resyncs from any other live peer, so per-node backup is optional. |
### Per-host backup
```sh
# /etc/cron.daily/quptime-backup
#!/bin/sh
set -eu
dst=/var/backups/quptime/$(date +%Y%m%d)
mkdir -p "$dst"
cp -a /etc/quptime/node.yaml "$dst/"
cp -a /etc/quptime/keys "$dst/keys"
cp -a /etc/quptime/cluster.yaml "$dst/cluster.yaml"
chmod -R go-rwx "$dst"
```
### Cluster-wide backup
The cluster state (`peers`, `checks`, `alerts`) is identical across
every node. Back up one healthy node's `cluster.yaml` and you have
the canonical copy. To restore:
```sh
# Stop the daemon.
sudo systemctl stop quptime
# Drop in the backup. Reset the version to 0 so the running cluster's
# higher version supersedes whatever you're holding — otherwise this
# node will broadcast a stale snapshot and confuse everyone.
sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml
sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml
sudo systemctl start quptime
# Within seconds the version-observer pulls the live version from a peer.
```
If you're restoring **the entire cluster** (every node lost), the
"reset version to 0" trick doesn't apply — there's no peer with a
higher version. Pick the highest-version backup, restore that file
across every node verbatim, and start the daemons. The cluster will
elect a master and continue.
## Replacing a dead node
A node has died permanently. You want to add a fresh box with the
same role.
1. On a surviving node, evict the dead one:
```sh
sudo -u quptime qu node remove <dead-node-id>
```
This drops it from `cluster.yaml` and removes its trust entry. The
live set's size shrinks by one — verify quorum still holds.
2. On the new host, install `qu` and `qu init` against the existing
cluster secret:
```sh
sudo -u quptime qu init \
--advertise delta.example.com:9901 \
--secret '<existing cluster secret>'
sudo systemctl start quptime
```
3. From a surviving node, invite the new one:
```sh
sudo -u quptime qu node add delta.example.com:9901
```
The dead node's checks and alerts are unaffected — they live in the
replicated `cluster.yaml`, not the dead node's identity.
## Recovering from lost quorum
You've lost more than half the cluster simultaneously. The remaining
nodes refuse to mutate (correct behaviour: they have no way to know
whether the missing nodes are dead or partitioned).
Options:
- **Bring the missing nodes back.** Always the right first move if it's
possible. The cluster recovers automatically once enough nodes are
live.
- **Shrink the cluster.** If you've genuinely lost the missing nodes
permanently and can't bring them back, you need to manually edit
`cluster.yaml` on every surviving node to remove the dead peers,
then restart. Be very deliberate:
```sh
# On each surviving node:
sudo systemctl stop quptime
sudoedit /etc/quptime/cluster.yaml # delete the dead peers[] entries
# bump version to something higher
sudo systemctl start quptime
```
Make sure every surviving node has identical `cluster.yaml` content
before restarting any of them. If they don't, you'll get conflicting
views of who's in the cluster and elections will flap.
- **Start over.** For small clusters this is often faster than the
manual surgery above: `rm -rf /etc/quptime` everywhere, then
bootstrap from scratch. You'll lose your checks and alerts unless
you saved a copy of `cluster.yaml` elsewhere.
## Monitoring `qu` itself
`qu` watches your services. Who watches `qu`?
### From within the cluster
`qu status` is the single source of truth. The fields to watch:
| Field | Healthy | Suspicious |
| -------------- | -------------- | --------------------------------------------------------- |
| `quorum` | `true` | `false` — no mutations, no alerts. |
| `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. |
| `term` | slow growth | rapid growth → master flapping, network unstable. |
| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
| `config ver` | identical across nodes | divergence → a node is stuck pulling. |
A simple cron sentinel on each node:
```sh
*/5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \
|| curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall
```
### From outside the cluster
`qu` does not currently expose a Prometheus / OpenMetrics endpoint.
The recommended pattern is to run a *separate* tiny monitoring path
that doesn't depend on `qu` — even a single `curl` health check on
each node's :9901 (which is TLS-only; you'll see a handshake succeed
even if the daemon's stuck) catches process death.
To produce structured metrics, write a sidecar that parses `qu status`
output and exports counters. The CLI emits stable, machine-grep-able
output specifically so this is straightforward.
## Operational checklist before you go to bed
After standing up a new cluster, work through:
- [ ] All nodes show `quorum true` in `qu status`.
- [ ] All nodes show identical `config ver`.
- [ ] All nodes show the same `master`.
- [ ] `journalctl -u quptime --since "10 min ago"` has no
`propose to master:` or `replicate: pull from:` errors.
- [ ] `qu alert test <name>` reaches your inbox / Discord channel for
every configured alert.
- [ ] At least one check has an intentional failure (a bogus target)
that you flip back and forth to verify the full state-transition
→ dispatch path end-to-end.
- [ ] Backups of `node.yaml` + `keys/` + `cluster.yaml` are landing in
your backup destination.
- [ ] Firewall allow-list (if any) lists every peer's IP.
- [ ] You've stored the cluster secret somewhere that survives the
first operator leaving.