Files
QUptime/docs/troubleshooting.md
T
Axodouble 1e2e382867
Container image / image (push) Successful in 1m40s
Updated docs, readme, & changelog
2026-05-15 07:36:01 +00:00

234 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Troubleshooting
The cluster is misbehaving. This page is organised by symptom. Each
entry pairs the user-visible signal with the log line(s) you'll see
in `journalctl -u quptime` and the fix.
## `qu status` shows `quorum false`
**What it means.** Fewer than ⌈N/2⌉+1 peers are live.
**Diagnose.** Look at the PEERS table. The `LIVE` column tells you
which peers this node has stopped hearing from.
- If only this node is "live" and everyone else is not → this node is
network-isolated. Test: `nc -zv <peer-advertise>`. Fix: network /
firewall.
- If multiple nodes show false → more than one peer is down. Look at
the other peers' status outputs to triangulate.
- If everyone is live but `quorum false` still → check
`cluster.yaml.peers` length vs. live count; you may have phantom
peer entries left over from a removed-but-not-evicted node. Fix:
`qu node remove <ghost-node-id>` from any live node.
## `qu status` shows `master (none — ...)`
**What it means.** Either no quorum (see above) or election is in
flight. The latter clears within ~1 heartbeat.
If `term` is incrementing rapidly (`watch qu status`), the master is
flapping. Causes:
- The currently-elected master is unreachable from some peers but
reachable from others, partial-partition style. Look for log lines
on the suspected master about peers it can't reach.
- Heartbeat timeouts (default 4s) are too tight for your inter-node
link. Rebuild with a higher `DefaultDeadAfter` if you need it.
## Primary master came back but the cluster hasn't switched to it
**What it means.** Working as designed. After a returning peer with a
lower NodeID rejoins, the quorum manager waits
`DefaultMasterCooldown` (2 minutes) before letting it displace the
incumbent. The window prevents a self-monitoring master from flapping
the role in lock-step with its own restart.
How to confirm:
- `qu status` on every node shows the same (current) master and a
steady `term` — not flapping. The lower-NodeID peer is in the live
set but not yet master.
- After ~2 minutes of continuous liveness, `term` bumps once and the
master switches to the lower-NodeID peer.
If you need a different window, change `DefaultMasterCooldown` in
`internal/quorum/manager.go` and rebuild.
## A check is stuck in `unknown`
**What it means.** The aggregator has no fresh reports for that check.
Possible causes:
- No node is actually running the probe yet. Probes start ~`interval/10`
after `qu serve` boots and reconcile every 5s. Wait 10s and
re-check.
- Nodes are submitting results but they're stale (older than 3×
interval). Probably means probes are timing out without reporting.
- This is a follower's view; the aggregator runs on the master only.
Check `qu status` on the master to see the canonical view.
## Alerts not firing
Walk this list in order; one of them will catch it:
1. **Is there quorum?** Aggregator runs on master only. No master →
no transitions → no alerts.
2. **Is the alert attached to the check?** `qu status` shows the
effective alert list per check. Empty → no alert. Confirm with
`qu alert list` that the alert exists and (if relying on default
attachment) has `default: true`.
3. **Is the alert suppressed on this check?** Check
`suppress_alert_ids` in `cluster.yaml`.
4. **Test the alert path directly:**
```sh
sudo -u quptime qu alert test <name>
```
This bypasses the aggregator and renders a synthetic transition.
If `alert test` doesn't deliver, the problem is the notifier
config or the template — see below. If `alert test` works but real
transitions don't, the aggregator isn't observing the transition.
5. **Has the check actually transitioned?** Aggregator commits a flip
only after **two consecutive** evaluations agree. A bouncing
target may never satisfy the hysteresis. Lower the check interval
or increase reliability of the target.
## Discord webhook returns 4xx
The dispatcher logs the HTTP body. Common causes:
- Webhook revoked / channel deleted → 404. Re-issue and update
`discord_webhook`.
- Body too large → 400. Long templates that pull `Snapshot.Detail`
with multi-line errors can blow past Discord's 2000-char limit.
Shorten the template or trim the variable.
- Rate-limited → 429. Reduce alert frequency or stop suppressing
hysteresis.
## SMTP refuses the message
Check the daemon log for `smtp:` lines. Most common:
- `530 5.7.0 Must issue a STARTTLS command first` → set
`smtp_starttls: true` on the alert.
- `535 Authentication failed` → wrong `smtp_user` / `smtp_password`.
- Connection refused / timeout → firewall between `qu` and the SMTP
relay. Verify with `openssl s_client -starttls smtp -connect host:587`.
## Manual edit to `cluster.yaml` was ignored
Symptoms: you edited the file, saved, nothing happened.
Look for one of these log lines:
- `manual-edit: parse cluster.yaml: <err> — ignoring` → YAML is
invalid. The daemon pins the bad hash and waits for the next valid
save. Run the file through `yq` or `python -c "import yaml,sys;
yaml.safe_load(open(sys.argv[1]))" cluster.yaml` to diagnose.
- `manual-edit: cluster.yaml changed externally — replicating via
master` followed by `manual-edit: forward to master: no quorum` →
cluster has no quorum, can't accept the edit. Restore quorum first.
- *No log line at all* → the on-disk content didn't change in a way
that matters. The watcher compares only `peers`, `checks`, and
`alerts`; whitespace and comment edits are accepted silently.
## Two nodes disagree on `config ver`
The follower with the lower version should pull within one heartbeat.
If after ~5 seconds the gap persists:
- The follower might not have an `advertise` address for the higher-
versioned peer. The version observer needs one to pull. Check
`cluster.yaml.peers` for both sides' `advertise` fields.
- The follower's TLS handshake against the higher-versioned peer is
failing — look for `replicate: pull from <id>: <err>` lines.
- The peer with the higher version is announcing it correctly but the
follower is rejecting the `ApplyClusterCfg` broadcasts because of
its own decode error — look for transport-layer errors instead.
## "needs ≥2 live to mutate" rejection during bootstrap
You ran two `qu node add` commands back-to-back and the second one
failed. The first add doesn't take effect until the new peer sends
its first heartbeat (≤ 1 second); during that window the cluster has
size 2 and quorum size 2, so a *second* peer add from a 1-live
cluster looks like "mutate without quorum."
Fix: pause ~3 seconds between adds. The README and the systemd guide
both call this out.
## Daemon refuses to start
```
load node.yaml: open ...: no such file or directory
```
`qu serve` normally auto-bootstraps a missing `node.yaml` using the
`QUPTIME_*` env vars (see
[configuration.md](configuration.md#auto-init-on-qu-serve)). If you
still see this error, the most likely causes are:
- The data directory is read-only or owned by a different user — the
bootstrap can't write `node.yaml`. Fix permissions on
`$QUPTIME_DIR`.
- Something else removed `node.yaml` mid-run (a config-management
tool, a misconfigured volume). Re-run `qu serve` and it will
rebuild from env, or run `qu init` manually with the flags you
want.
```
node.yaml has empty node_id — run `qu init` first
```
`node.yaml` exists but lacks a `node_id`. Either delete the file and
let auto-init regenerate it, or run `qu init` against a wiped data
dir.
```
listen tcp :9901: bind: address already in use
```
Another process owns the port. `ss -tlnp | grep :9901` to find it.
```
load private key: ...
```
Permissions on `keys/private.pem` are wrong — should be 0600 and owned
by the daemon user. Fix and restart.
## Probes look much slower than expected
ICMP first:
- Default ICMP is **unprivileged UDP-mode pings**, not raw ICMP. UDP
ping is a bit slower and may hit different kernel paths. For
reference latency, grant `CAP_NET_RAW`.
HTTP / TCP:
- `interval` and `timeout` are the only knobs in `cluster.yaml`. The
check is run synchronously per worker; if your target takes 9 s to
respond and your timeout is 10 s, the next probe doesn't start
until ~9 s elapsed. Increase concurrency by adding more
fast-interval checks against the same target, not by lowering
timeout (which will just produce false `down` results).
## I want to start over
```sh
sudo systemctl stop quptime
sudo rm -rf /etc/quptime
sudo -u quptime qu init --advertise <addr>
sudo systemctl start quptime
```
The data directory is the only state. Wipe it and you're back to a
fresh node.
Under Docker (or any env-driven deploy), the explicit `qu init` step
isn't needed — wiping the data volume and restarting the container is
enough; `qu serve` will re-bootstrap from the `QUPTIME_*` env vars.