QUptime/docs/troubleshooting.md

# Troubleshooting

The cluster is misbehaving. This page is organised by symptom. Each
entry pairs the user-visible signal with the log line(s) you'll see
in `journalctl -u quptime` and the fix.

## `qu status` shows `quorum  false`

**What it means.** Fewer than ⌈N/2⌉+1 peers are live.

**Diagnose.** Look at the PEERS table. The `LIVE` column tells you
which peers this node has stopped hearing from.

- If only this node is "live" and everyone else is not → this node is
  network-isolated. Test: `nc -zv <peer-advertise>`. Fix: network /
  firewall.
- If multiple nodes show false → more than one peer is down. Look at
  the other peers' status outputs to triangulate.
- If everyone is live but `quorum false` still → check
  `cluster.yaml.peers` length vs. live count; you may have phantom
  peer entries left over from a removed-but-not-evicted node. Fix:
  `qu node remove <ghost-node-id>` from any live node.

## `qu status` shows `master  (none — ...)`

**What it means.** Either no quorum (see above) or election is in
flight. The latter clears within ~1 heartbeat.

If `term` is incrementing rapidly (`watch qu status`), the master is
flapping. Causes:

- The currently-elected master is unreachable from some peers but
  reachable from others, partial-partition style. Look for log lines
  on the suspected master about peers it can't reach.
- Heartbeat timeouts (default 4s) are too tight for your inter-node
  link. Rebuild with a higher `DefaultDeadAfter` if you need it.

## Primary master came back but the cluster hasn't switched to it

**What it means.** Working as designed. After a returning peer with a
lower NodeID rejoins, the quorum manager waits
`DefaultMasterCooldown` (2 minutes) before letting it displace the
incumbent. The window prevents a self-monitoring master from flapping
the role in lock-step with its own restart.

How to confirm:

- `qu status` on every node shows the same (current) master and a
  steady `term` — not flapping. The lower-NodeID peer is in the live
  set but not yet master.
- After ~2 minutes of continuous liveness, `term` bumps once and the
  master switches to the lower-NodeID peer.

If you need a different window, change `DefaultMasterCooldown` in
`internal/quorum/manager.go` and rebuild.

## A check is stuck in `unknown`

**What it means.** The aggregator has no fresh reports for that check.

Possible causes:

- No node is actually running the probe yet. Probes start ~`interval/10`
  after `qu serve` boots and reconcile every 5s. Wait 10s and
  re-check.
- Nodes are submitting results but they're stale (older than 3×
  interval). Probably means probes are timing out without reporting.
- This is a follower's view; the aggregator runs on the master only.
  Check `qu status` on the master to see the canonical view.

## Alerts not firing

Walk this list in order; one of them will catch it:

1. **Is there quorum?** Aggregator runs on master only. No master →
   no transitions → no alerts.
2. **Is the alert attached to the check?** `qu status` shows the
   effective alert list per check. Empty → no alert. Confirm with
   `qu alert list` that the alert exists and (if relying on default
   attachment) has `default: true`.
3. **Is the alert suppressed on this check?** Check
   `suppress_alert_ids` in `cluster.yaml`.
4. **Test the alert path directly:**

   ```sh
   sudo -u quptime qu alert test <name>
   ```

   This bypasses the aggregator and renders a synthetic transition.
   If `alert test` doesn't deliver, the problem is the notifier
   config or the template — see below. If `alert test` works but real
   transitions don't, the aggregator isn't observing the transition.
5. **Has the check actually transitioned?** Aggregator commits a flip
   only after **two consecutive** evaluations agree. A bouncing
   target may never satisfy the hysteresis. Lower the check interval
   or increase reliability of the target.

## Discord webhook returns 4xx

The dispatcher logs the HTTP body. Common causes:

- Webhook revoked / channel deleted → 404. Re-issue and update
  `discord_webhook`.
- Body too large → 400. Long templates that pull `Snapshot.Detail`
  with multi-line errors can blow past Discord's 2000-char limit.
  Shorten the template or trim the variable.
- Rate-limited → 429. Reduce alert frequency or stop suppressing
  hysteresis.

## SMTP refuses the message

Check the daemon log for `smtp:` lines. Most common:

- `530 5.7.0 Must issue a STARTTLS command first` → set
  `smtp_starttls: true` on the alert.
- `535 Authentication failed` → wrong `smtp_user` / `smtp_password`.
- Connection refused / timeout → firewall between `qu` and the SMTP
  relay. Verify with `openssl s_client -starttls smtp -connect host:587`.

## Manual edit to `cluster.yaml` was ignored

Symptoms: you edited the file, saved, nothing happened.

Look for one of these log lines:

- `manual-edit: parse cluster.yaml: <err> — ignoring` → YAML is
  invalid. The daemon pins the bad hash and waits for the next valid
  save. Run the file through `yq` or `python -c "import yaml,sys;
  yaml.safe_load(open(sys.argv[1]))" cluster.yaml` to diagnose.
- `manual-edit: cluster.yaml changed externally — replicating via
  master` followed by `manual-edit: forward to master: no quorum` →
  cluster has no quorum, can't accept the edit. Restore quorum first.
- *No log line at all* → the on-disk content didn't change in a way
  that matters. The watcher compares only `peers`, `checks`, and
  `alerts`; whitespace and comment edits are accepted silently.

## Two nodes disagree on `config ver`

The follower with the lower version should pull within one heartbeat.
If after ~5 seconds the gap persists:

- The follower might not have an `advertise` address for the higher-
  versioned peer. The version observer needs one to pull. Check
  `cluster.yaml.peers` for both sides' `advertise` fields.
- The follower's TLS handshake against the higher-versioned peer is
  failing — look for `replicate: pull from <id>: <err>` lines.
- The peer with the higher version is announcing it correctly but the
  follower is rejecting the `ApplyClusterCfg` broadcasts because of
  its own decode error — look for transport-layer errors instead.

## "needs ≥2 live to mutate" rejection during bootstrap

You ran two `qu node add` commands back-to-back and the second one
failed. The first add doesn't take effect until the new peer sends
its first heartbeat (≤ 1 second); during that window the cluster has
size 2 and quorum size 2, so a *second* peer add from a 1-live
cluster looks like "mutate without quorum."

Fix: pause ~3 seconds between adds. The README and the systemd guide
both call this out.

## Daemon refuses to start

```
load node.yaml: open ...: no such file or directory
```

`qu serve` normally auto-bootstraps a missing `node.yaml` using the
`QUPTIME_*` env vars (see
[configuration.md](configuration.md#auto-init-on-qu-serve)). If you
still see this error, the most likely causes are:

- The data directory is read-only or owned by a different user — the
  bootstrap can't write `node.yaml`. Fix permissions on
  `$QUPTIME_DIR`.
- Something else removed `node.yaml` mid-run (a config-management
  tool, a misconfigured volume). Re-run `qu serve` and it will
  rebuild from env, or run `qu init` manually with the flags you
  want.

```
node.yaml has empty node_id — run `qu init` first
```

`node.yaml` exists but lacks a `node_id`. Either delete the file and
let auto-init regenerate it, or run `qu init` against a wiped data
dir.

```
listen tcp :9901: bind: address already in use
```

Another process owns the port. `ss -tlnp | grep :9901` to find it.

```
load private key: ...
```

Permissions on `keys/private.pem` are wrong — should be 0600 and owned
by the daemon user. Fix and restart.

## Probes look much slower than expected

ICMP first:

- Default ICMP is **unprivileged UDP-mode pings**, not raw ICMP. UDP
  ping is a bit slower and may hit different kernel paths. For
  reference latency, grant `CAP_NET_RAW`.

HTTP / TCP:

- `interval` and `timeout` are the only knobs in `cluster.yaml`. The
  check is run synchronously per worker; if your target takes 9 s to
  respond and your timeout is 10 s, the next probe doesn't start
  until ~9 s elapsed. Increase concurrency by adding more
  fast-interval checks against the same target, not by lowering
  timeout (which will just produce false `down` results).

## I want to start over

```sh
sudo systemctl stop quptime
sudo rm -rf /etc/quptime
sudo -u quptime qu init --advertise <addr>
sudo systemctl start quptime
```

The data directory is the only state. Wipe it and you're back to a
fresh node.

Under Docker (or any env-driven deploy), the explicit `qu init` step
isn't needed — wiping the data volume and restarting the container is
enough; `qu serve` will re-bootstrap from the `QUPTIME_*` env vars.