Files
QUptime/docs/operations.md
Axodouble 1e2e382867
Container image / image (push) Successful in 1m40s
Updated docs, readme, & changelog
2026-05-15 07:36:01 +00:00

8.4 KiB

Operations

Day-2 tasks: keeping qu healthy, upgrading without dropping checks, backing up state, recovering from failures. Pair this with troubleshooting.md for "the cluster is on fire, what now" specifics.

Upgrades

Rolling upgrade (zero alert loss)

qu is built to tolerate one node being absent at a time as long as quorum still holds. The simple recipe for a 3-node cluster:

# On each node in turn:
sudo systemctl stop quptime
sudo install -m 0755 qu-new /usr/local/bin/qu
sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
sudo systemctl start quptime

# Wait for the node to rejoin before moving on:
sudo -u quptime qu status   # should show quorum true, all peers live

The first node you upgrade may briefly be a follower with a higher binary version than the master. That's fine as long as no on-disk format changes; the wire protocol and cluster.yaml schema are stable within a minor version, so minor / patch upgrades freely interleave.

For major-version upgrades that change the on-disk format, the release notes will spell out the migration. As of v0 there have been none.

Downgrades

A node that downgrades to an older binary will refuse to start if cluster.yaml contains fields the older version doesn't know. To roll back across a schema change, either:

  • Take the cluster offline and downgrade all nodes simultaneously.
  • Restore a cluster.yaml from before the schema change on every node before starting the downgraded binary.

Within a single minor version, downgrade is symmetrical with upgrade.

What can go wrong

  • Restarting two nodes at once in a 3-node cluster loses quorum. No mutations succeed, no alerts fire. Quorum returns the moment the second node is back.
  • A node that has been offline for a long time comes back with a stale cluster.yaml. It will pull the master's higher version within ~1 heartbeat. Don't pre-emptively delete its cluster.yaml — let the catch-up path handle it.

Backups

Three files matter, in descending order of "pain if lost":

File Why back it up
node.yaml Holds the cluster secret. Lose it and the node can't rejoin.
keys/private.pem Lose it and you must qu init a fresh identity and re-trust.
cluster.yaml Resyncs from any other live peer, so per-node backup is optional.

Per-host backup

# /etc/cron.daily/quptime-backup
#!/bin/sh
set -eu
dst=/var/backups/quptime/$(date +%Y%m%d)
mkdir -p "$dst"
cp -a /etc/quptime/node.yaml         "$dst/"
cp -a /etc/quptime/keys              "$dst/keys"
cp -a /etc/quptime/cluster.yaml      "$dst/cluster.yaml"
chmod -R go-rwx "$dst"

Cluster-wide backup

The cluster state (peers, checks, alerts) is identical across every node. Back up one healthy node's cluster.yaml and you have the canonical copy. To restore:

# Stop the daemon.
sudo systemctl stop quptime

# Drop in the backup. Reset the version to 0 so the running cluster's
# higher version supersedes whatever you're holding — otherwise this
# node will broadcast a stale snapshot and confuse everyone.
sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml
sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml

sudo systemctl start quptime
# Within seconds the version-observer pulls the live version from a peer.

If you're restoring the entire cluster (every node lost), the "reset version to 0" trick doesn't apply — there's no peer with a higher version. Pick the highest-version backup, restore that file across every node verbatim, and start the daemons. The cluster will elect a master and continue.

Replacing a dead node

A node has died permanently. You want to add a fresh box with the same role.

  1. On a surviving node, evict the dead one:

    sudo -u quptime qu node remove <dead-node-id>
    

    This drops it from cluster.yaml and removes its trust entry. The live set's size shrinks by one — verify quorum still holds.

  2. On the new host, install qu and qu init against the existing cluster secret:

    sudo -u quptime qu init \
      --advertise delta.example.com:9901 \
      --secret '<existing cluster secret>'
    sudo systemctl start quptime
    
  3. From a surviving node, invite the new one:

    sudo -u quptime qu node add delta.example.com:9901
    

The dead node's checks and alerts are unaffected — they live in the replicated cluster.yaml, not the dead node's identity.

Recovering from lost quorum

You've lost more than half the cluster simultaneously. The remaining nodes refuse to mutate (correct behaviour: they have no way to know whether the missing nodes are dead or partitioned).

Options:

  • Bring the missing nodes back. Always the right first move if it's possible. The cluster recovers automatically once enough nodes are live.

  • Shrink the cluster. If you've genuinely lost the missing nodes permanently and can't bring them back, you need to manually edit cluster.yaml on every surviving node to remove the dead peers, then restart. Be very deliberate:

    # On each surviving node:
    sudo systemctl stop quptime
    sudoedit /etc/quptime/cluster.yaml   # delete the dead peers[] entries
                                          # bump version to something higher
    sudo systemctl start quptime
    

    Make sure every surviving node has identical cluster.yaml content before restarting any of them. If they don't, you'll get conflicting views of who's in the cluster and elections will flap.

  • Start over. For small clusters this is often faster than the manual surgery above: rm -rf /etc/quptime everywhere, then bootstrap from scratch. You'll lose your checks and alerts unless you saved a copy of cluster.yaml elsewhere.

Monitoring qu itself

qu watches your services. Who watches qu?

From within the cluster

qu status is the single source of truth. The fields to watch:

Field Healthy Suspicious
quorum true false — no mutations, no alerts.
master a NodeID (none — ...) — quorum lost or election in flight.
term slow growth rapid growth → master flapping, network unstable.
master after a restart of the primary unchanged for ~2 min, then bumps back bumps back immediately → cooldown disabled or misconfigured.
config ver identical across nodes divergence → a node is stuck pulling.

A simple cron sentinel on each node:

*/5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \
  || curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall

From outside the cluster

qu does not currently expose a Prometheus / OpenMetrics endpoint. The recommended pattern is to run a separate tiny monitoring path that doesn't depend on qu — even a single curl health check on each node's :9901 (which is TLS-only; you'll see a handshake succeed even if the daemon's stuck) catches process death.

To produce structured metrics, write a sidecar that parses qu status output and exports counters. The CLI emits stable, machine-grep-able output specifically so this is straightforward.

Operational checklist before you go to bed

After standing up a new cluster, work through:

  • All nodes show quorum true in qu status.
  • All nodes show identical config ver.
  • All nodes show the same master.
  • journalctl -u quptime --since "10 min ago" has no propose to master: or replicate: pull from: errors.
  • qu alert test <name> reaches your inbox / Discord channel for every configured alert.
  • At least one check has an intentional failure (a bogus target) that you flip back and forth to verify the full state-transition → dispatch path end-to-end.
  • Backups of node.yaml + keys/ + cluster.yaml are landing in your backup destination.
  • Firewall allow-list (if any) lists every peer's IP.
  • You've stored the cluster secret somewhere that survives the first operator leaving.