Files
QUptime/docs/troubleshooting.md
T
Axodouble 6953709574
Container image / image (push) Successful in 1m37s
AI assisted documentation
2026-05-15 04:05:30 +00:00

7.2 KiB
Raw Blame History

Troubleshooting

The cluster is misbehaving. This page is organised by symptom. Each entry pairs the user-visible signal with the log line(s) you'll see in journalctl -u quptime and the fix.

qu status shows quorum false

What it means. Fewer than ⌈N/2⌉+1 peers are live.

Diagnose. Look at the PEERS table. The LIVE column tells you which peers this node has stopped hearing from.

  • If only this node is "live" and everyone else is not → this node is network-isolated. Test: nc -zv <peer-advertise>. Fix: network / firewall.
  • If multiple nodes show false → more than one peer is down. Look at the other peers' status outputs to triangulate.
  • If everyone is live but quorum false still → check cluster.yaml.peers length vs. live count; you may have phantom peer entries left over from a removed-but-not-evicted node. Fix: qu node remove <ghost-node-id> from any live node.

qu status shows master (none — ...)

What it means. Either no quorum (see above) or election is in flight. The latter clears within ~1 heartbeat.

If term is incrementing rapidly (watch qu status), the master is flapping. Causes:

  • The currently-elected master is unreachable from some peers but reachable from others, partial-partition style. Look for log lines on the suspected master about peers it can't reach.
  • Heartbeat timeouts (default 4s) are too tight for your inter-node link. Rebuild with a higher DefaultDeadAfter if you need it.

A check is stuck in unknown

What it means. The aggregator has no fresh reports for that check.

Possible causes:

  • No node is actually running the probe yet. Probes start ~interval/10 after qu serve boots and reconcile every 5s. Wait 10s and re-check.
  • Nodes are submitting results but they're stale (older than 3× interval). Probably means probes are timing out without reporting.
  • This is a follower's view; the aggregator runs on the master only. Check qu status on the master to see the canonical view.

Alerts not firing

Walk this list in order; one of them will catch it:

  1. Is there quorum? Aggregator runs on master only. No master → no transitions → no alerts.

  2. Is the alert attached to the check? qu status shows the effective alert list per check. Empty → no alert. Confirm with qu alert list that the alert exists and (if relying on default attachment) has default: true.

  3. Is the alert suppressed on this check? Check suppress_alert_ids in cluster.yaml.

  4. Test the alert path directly:

    sudo -u quptime qu alert test <name>
    

    This bypasses the aggregator and renders a synthetic transition. If alert test doesn't deliver, the problem is the notifier config or the template — see below. If alert test works but real transitions don't, the aggregator isn't observing the transition.

  5. Has the check actually transitioned? Aggregator commits a flip only after two consecutive evaluations agree. A bouncing target may never satisfy the hysteresis. Lower the check interval or increase reliability of the target.

Discord webhook returns 4xx

The dispatcher logs the HTTP body. Common causes:

  • Webhook revoked / channel deleted → 404. Re-issue and update discord_webhook.
  • Body too large → 400. Long templates that pull Snapshot.Detail with multi-line errors can blow past Discord's 2000-char limit. Shorten the template or trim the variable.
  • Rate-limited → 429. Reduce alert frequency or stop suppressing hysteresis.

SMTP refuses the message

Check the daemon log for smtp: lines. Most common:

  • 530 5.7.0 Must issue a STARTTLS command first → set smtp_starttls: true on the alert.
  • 535 Authentication failed → wrong smtp_user / smtp_password.
  • Connection refused / timeout → firewall between qu and the SMTP relay. Verify with openssl s_client -starttls smtp -connect host:587.

Manual edit to cluster.yaml was ignored

Symptoms: you edited the file, saved, nothing happened.

Look for one of these log lines:

  • manual-edit: parse cluster.yaml: <err> — ignoring → YAML is invalid. The daemon pins the bad hash and waits for the next valid save. Run the file through yq or python -c "import yaml,sys; yaml.safe_load(open(sys.argv[1]))" cluster.yaml to diagnose.
  • manual-edit: cluster.yaml changed externally — replicating via master followed by manual-edit: forward to master: no quorum → cluster has no quorum, can't accept the edit. Restore quorum first.
  • No log line at all → the on-disk content didn't change in a way that matters. The watcher compares only peers, checks, and alerts; whitespace and comment edits are accepted silently.

Two nodes disagree on config ver

The follower with the lower version should pull within one heartbeat. If after ~5 seconds the gap persists:

  • The follower might not have an advertise address for the higher- versioned peer. The version observer needs one to pull. Check cluster.yaml.peers for both sides' advertise fields.
  • The follower's TLS handshake against the higher-versioned peer is failing — look for replicate: pull from <id>: <err> lines.
  • The peer with the higher version is announcing it correctly but the follower is rejecting the ApplyClusterCfg broadcasts because of its own decode error — look for transport-layer errors instead.

"needs ≥2 live to mutate" rejection during bootstrap

You ran two qu node add commands back-to-back and the second one failed. The first add doesn't take effect until the new peer sends its first heartbeat (≤ 1 second); during that window the cluster has size 2 and quorum size 2, so a second peer add from a 1-live cluster looks like "mutate without quorum."

Fix: pause ~3 seconds between adds. The README and the systemd guide both call this out.

Daemon refuses to start

load node.yaml: open ...: no such file or directory

Run qu init before qu serve. The daemon does not auto-init — silently generating identities and secrets would be a worse failure mode than crashing.

node.yaml has empty node_id — run `qu init` first

Same fix.

listen tcp :9901: bind: address already in use

Another process owns the port. ss -tlnp | grep :9901 to find it.

load private key: ...

Permissions on keys/private.pem are wrong — should be 0600 and owned by the daemon user. Fix and restart.

Probes look much slower than expected

ICMP first:

  • Default ICMP is unprivileged UDP-mode pings, not raw ICMP. UDP ping is a bit slower and may hit different kernel paths. For reference latency, grant CAP_NET_RAW.

HTTP / TCP:

  • interval and timeout are the only knobs in cluster.yaml. The check is run synchronously per worker; if your target takes 9 s to respond and your timeout is 10 s, the next probe doesn't start until ~9 s elapsed. Increase concurrency by adding more fast-interval checks against the same target, not by lowering timeout (which will just produce false down results).

I want to start over

sudo systemctl stop quptime
sudo rm -rf /etc/quptime
sudo -u quptime qu init --advertise <addr>
sudo systemctl start quptime

The data directory is the only state. Wipe it and you're back to a fresh node.