QUptime/docs/operations.md

# Operations

Day-2 tasks: keeping `qu` healthy, upgrading without dropping checks,
backing up state, recovering from failures. Pair this with
[troubleshooting.md](troubleshooting.md) for "the cluster is on fire,
what now" specifics.

## Upgrades

### Rolling upgrade (zero alert loss)

`qu` is built to tolerate one node being absent at a time as long as
quorum still holds. The simple recipe for a 3-node cluster:

```sh
# On each node in turn:
sudo systemctl stop quptime
sudo install -m 0755 qu-new /usr/local/bin/qu
sudo setcap cap_net_raw=+ep /usr/local/bin/qu   # if you use raw ICMP
sudo systemctl start quptime

# Wait for the node to rejoin before moving on:
sudo -u quptime qu status   # should show quorum true, all peers live
```

The first node you upgrade may briefly be a follower with a *higher*
binary version than the master. That's fine as long as no on-disk
format changes; the wire protocol and `cluster.yaml` schema are
stable within a minor version, so minor / patch upgrades freely
interleave.

For major-version upgrades that change the on-disk format, the release
notes will spell out the migration. As of v0 there have been none.

### Downgrades

A node that downgrades to an older binary will refuse to start if
`cluster.yaml` contains fields the older version doesn't know. To
roll back across a schema change, either:

- Take the cluster offline and downgrade all nodes simultaneously.
- Restore a `cluster.yaml` from before the schema change on every node
  before starting the downgraded binary.

Within a single minor version, downgrade is symmetrical with upgrade.

### What can go wrong

- **Restarting two nodes at once in a 3-node cluster** loses quorum.
  No mutations succeed, no alerts fire. Quorum returns the moment
  the second node is back.
- **A node that has been offline for a long time** comes back with a
  stale `cluster.yaml`. It will pull the master's higher version
  within ~1 heartbeat. Don't pre-emptively delete its `cluster.yaml`
  — let the catch-up path handle it.

## Backups

Three files matter, in descending order of "pain if lost":

| File                   | Why back it up                                                       |
| ---------------------- | -------------------------------------------------------------------- |
| `node.yaml`            | Holds the cluster secret. Lose it and the node can't rejoin.         |
| `keys/private.pem`     | Lose it and you must `qu init` a fresh identity and re-trust.        |
| `cluster.yaml`         | Resyncs from any other live peer, so per-node backup is optional.    |

### Per-host backup

```sh
# /etc/cron.daily/quptime-backup
#!/bin/sh
set -eu
dst=/var/backups/quptime/$(date +%Y%m%d)
mkdir -p "$dst"
cp -a /etc/quptime/node.yaml         "$dst/"
cp -a /etc/quptime/keys              "$dst/keys"
cp -a /etc/quptime/cluster.yaml      "$dst/cluster.yaml"
chmod -R go-rwx "$dst"
```

### Cluster-wide backup

The cluster state (`peers`, `checks`, `alerts`) is identical across
every node. Back up one healthy node's `cluster.yaml` and you have
the canonical copy. To restore:

```sh
# Stop the daemon.
sudo systemctl stop quptime

# Drop in the backup. Reset the version to 0 so the running cluster's
# higher version supersedes whatever you're holding — otherwise this
# node will broadcast a stale snapshot and confuse everyone.
sudo cp backup-cluster.yaml /etc/quptime/cluster.yaml
sudo sed -i 's/^version:.*/version: 0/' /etc/quptime/cluster.yaml

sudo systemctl start quptime
# Within seconds the version-observer pulls the live version from a peer.
```

If you're restoring **the entire cluster** (every node lost), the
"reset version to 0" trick doesn't apply — there's no peer with a
higher version. Pick the highest-version backup, restore that file
across every node verbatim, and start the daemons. The cluster will
elect a master and continue.

## Replacing a dead node

A node has died permanently. You want to add a fresh box with the
same role.

1. On a surviving node, evict the dead one:

   ```sh
   sudo -u quptime qu node remove <dead-node-id>
   ```

   This drops it from `cluster.yaml` and removes its trust entry. The
   live set's size shrinks by one — verify quorum still holds.

2. On the new host, install `qu` and `qu init` against the existing
   cluster secret:

   ```sh
   sudo -u quptime qu init \
     --advertise delta.example.com:9901 \
     --secret '<existing cluster secret>'
   sudo systemctl start quptime
   ```

3. From a surviving node, invite the new one:

   ```sh
   sudo -u quptime qu node add delta.example.com:9901
   ```

The dead node's checks and alerts are unaffected — they live in the
replicated `cluster.yaml`, not the dead node's identity.

## Recovering from lost quorum

You've lost more than half the cluster simultaneously. The remaining
nodes refuse to mutate (correct behaviour: they have no way to know
whether the missing nodes are dead or partitioned).

Options:

- **Bring the missing nodes back.** Always the right first move if it's
  possible. The cluster recovers automatically once enough nodes are
  live.
- **Shrink the cluster.** If you've genuinely lost the missing nodes
  permanently and can't bring them back, you need to manually edit
  `cluster.yaml` on every surviving node to remove the dead peers,
  then restart. Be very deliberate:

  ```sh
  # On each surviving node:
  sudo systemctl stop quptime
  sudoedit /etc/quptime/cluster.yaml   # delete the dead peers[] entries
                                        # bump version to something higher
  sudo systemctl start quptime
  ```

  Make sure every surviving node has identical `cluster.yaml` content
  before restarting any of them. If they don't, you'll get conflicting
  views of who's in the cluster and elections will flap.

- **Start over.** For small clusters this is often faster than the
  manual surgery above: `rm -rf /etc/quptime` everywhere, then
  bootstrap from scratch. You'll lose your checks and alerts unless
  you saved a copy of `cluster.yaml` elsewhere.

## Monitoring `qu` itself

`qu` watches your services. Who watches `qu`?

### From within the cluster

`qu status` is the single source of truth. The fields to watch:

| Field          | Healthy        | Suspicious                                                |
| -------------- | -------------- | --------------------------------------------------------- |
| `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
| `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
| `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
| `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |

A simple cron sentinel on each node:

```sh
*/5 * * * * /usr/local/bin/qu status >/dev/null 2>&1 \
  || curl -fsSL -X POST -d "qu down on $(hostname)" https://alert.example.com/oncall
```

### From outside the cluster

`qu` does not currently expose a Prometheus / OpenMetrics endpoint.
The recommended pattern is to run a *separate* tiny monitoring path
that doesn't depend on `qu` — even a single `curl` health check on
each node's :9901 (which is TLS-only; you'll see a handshake succeed
even if the daemon's stuck) catches process death.

To produce structured metrics, write a sidecar that parses `qu status`
output and exports counters. The CLI emits stable, machine-grep-able
output specifically so this is straightforward.

## Operational checklist before you go to bed

After standing up a new cluster, work through:

- [ ] All nodes show `quorum true` in `qu status`.
- [ ] All nodes show identical `config ver`.
- [ ] All nodes show the same `master`.
- [ ] `journalctl -u quptime --since "10 min ago"` has no
      `propose to master:` or `replicate: pull from:` errors.
- [ ] `qu alert test <name>` reaches your inbox / Discord channel for
      every configured alert.
- [ ] At least one check has an intentional failure (a bogus target)
      that you flip back and forth to verify the full state-transition
      → dispatch path end-to-end.
- [ ] Backups of `node.yaml` + `keys/` + `cluster.yaml` are landing in
      your backup destination.
- [ ] Firewall allow-list (if any) lists every peer's IP.
- [ ] You've stored the cluster secret somewhere that survives the
      first operator leaving.