9.0 KiB
Troubleshooting
The cluster is misbehaving. This page is organised by symptom. Each
entry pairs the user-visible signal with the log line(s) you'll see
in journalctl -u quptime and the fix.
qu status shows quorum false
What it means. Fewer than ⌈N/2⌉+1 peers are live.
Diagnose. Look at the PEERS table. The LIVE column tells you
which peers this node has stopped hearing from.
- If only this node is "live" and everyone else is not → this node is
network-isolated. Test:
nc -zv <peer-advertise>. Fix: network / firewall. - If multiple nodes show false → more than one peer is down. Look at the other peers' status outputs to triangulate.
- If everyone is live but
quorum falsestill → checkcluster.yaml.peerslength vs. live count; you may have phantom peer entries left over from a removed-but-not-evicted node. Fix:qu node remove <ghost-node-id>from any live node.
qu status shows master (none — ...)
What it means. Either no quorum (see above) or election is in flight. The latter clears within ~1 heartbeat.
If term is incrementing rapidly (watch qu status), the master is
flapping. Causes:
- The currently-elected master is unreachable from some peers but reachable from others, partial-partition style. Look for log lines on the suspected master about peers it can't reach.
- Heartbeat timeouts (default 4s) are too tight for your inter-node
link. Rebuild with a higher
DefaultDeadAfterif you need it.
Primary master came back but the cluster hasn't switched to it
What it means. Working as designed. After a returning peer with a
lower NodeID rejoins, the quorum manager waits
DefaultMasterCooldown (2 minutes) before letting it displace the
incumbent. The window prevents a self-monitoring master from flapping
the role in lock-step with its own restart.
How to confirm:
qu statuson every node shows the same (current) master and a steadyterm— not flapping. The lower-NodeID peer is in the live set but not yet master.- After ~2 minutes of continuous liveness,
termbumps once and the master switches to the lower-NodeID peer.
If you need a different window, change DefaultMasterCooldown in
internal/quorum/manager.go and rebuild.
A check is stuck in unknown
What it means. The aggregator has no fresh reports for that check.
Possible causes:
- No node is actually running the probe yet. Probes start ~
interval/10afterqu serveboots and reconcile every 5s. Wait 10s and re-check. - Nodes are submitting results but they're stale (older than 3× interval). Probably means probes are timing out without reporting.
- This is a follower's view; the aggregator runs on the master only.
Check
qu statuson the master to see the canonical view.
Alerts not firing
Walk this list in order; one of them will catch it:
-
Is there quorum? Aggregator runs on master only. No master → no transitions → no alerts.
-
Is the alert attached to the check?
qu statusshows the effective alert list per check. Empty → no alert. Confirm withqu alert listthat the alert exists and (if relying on default attachment) hasdefault: true. -
Is the alert suppressed on this check? Check
suppress_alert_idsincluster.yaml. -
Test the alert path directly:
sudo -u quptime qu alert test <name>This bypasses the aggregator and renders a synthetic transition. If
alert testdoesn't deliver, the problem is the notifier config or the template — see below. Ifalert testworks but real transitions don't, the aggregator isn't observing the transition. -
Has the check actually transitioned? Aggregator commits a flip only after two consecutive evaluations agree. A bouncing target may never satisfy the hysteresis. Lower the check interval or increase reliability of the target.
Discord webhook returns 4xx
The dispatcher logs the HTTP body. Common causes:
- Webhook revoked / channel deleted → 404. Re-issue and update
discord_webhook. - Body too large → 400. Long templates that pull
Snapshot.Detailwith multi-line errors can blow past Discord's 2000-char limit. Shorten the template or trim the variable. - Rate-limited → 429. Reduce alert frequency or stop suppressing hysteresis.
SMTP refuses the message
Check the daemon log for smtp: lines. Most common:
530 5.7.0 Must issue a STARTTLS command first→ setsmtp_starttls: trueon the alert.535 Authentication failed→ wrongsmtp_user/smtp_password.- Connection refused / timeout → firewall between
quand the SMTP relay. Verify withopenssl s_client -starttls smtp -connect host:587.
Manual edit to cluster.yaml was ignored
Symptoms: you edited the file, saved, nothing happened.
Look for one of these log lines:
manual-edit: parse cluster.yaml: <err> — ignoring→ YAML is invalid. The daemon pins the bad hash and waits for the next valid save. Run the file throughyqorpython -c "import yaml,sys; yaml.safe_load(open(sys.argv[1]))" cluster.yamlto diagnose.manual-edit: cluster.yaml changed externally — replicating via masterfollowed bymanual-edit: forward to master: no quorum→ cluster has no quorum, can't accept the edit. Restore quorum first.- No log line at all → the on-disk content didn't change in a way
that matters. The watcher compares only
peers,checks, andalerts; whitespace and comment edits are accepted silently.
Two nodes disagree on config ver
The follower with the lower version should pull within one heartbeat. If after ~5 seconds the gap persists:
- The follower might not have an
advertiseaddress for the higher- versioned peer. The version observer needs one to pull. Checkcluster.yaml.peersfor both sides'advertisefields. - The follower's TLS handshake against the higher-versioned peer is
failing — look for
replicate: pull from <id>: <err>lines. - The peer with the higher version is announcing it correctly but the
follower is rejecting the
ApplyClusterCfgbroadcasts because of its own decode error — look for transport-layer errors instead.
"needs ≥2 live to mutate" rejection during bootstrap
You ran two qu node add commands back-to-back and the second one
failed. The first add doesn't take effect until the new peer sends
its first heartbeat (≤ 1 second); during that window the cluster has
size 2 and quorum size 2, so a second peer add from a 1-live
cluster looks like "mutate without quorum."
Fix: pause ~3 seconds between adds. The README and the systemd guide both call this out.
Daemon refuses to start
load node.yaml: open ...: no such file or directory
qu serve normally auto-bootstraps a missing node.yaml using the
QUPTIME_* env vars (see
configuration.md). If you
still see this error, the most likely causes are:
- The data directory is read-only or owned by a different user — the
bootstrap can't write
node.yaml. Fix permissions on$QUPTIME_DIR. The fastest fix on a standard install is just to re-runinstall.sh— it reasserts the canonical ownership and modes on the whole tree without touching your config. - Something else removed
node.yamlmid-run (a config-management tool, a misconfigured volume). Re-runqu serveand it will rebuild from env, or runqu initmanually with the flags you want.
node.yaml has empty node_id — run `qu init` first
node.yaml exists but lacks a node_id. Either delete the file and
let auto-init regenerate it, or run qu init against a wiped data
dir.
listen tcp :9901: bind: address already in use
Another process owns the port. ss -tlnp | grep :9901 to find it.
load private key: ...
Permissions on keys/private.pem are wrong — should be 0600 and owned
by the daemon user. Fix and restart. Re-running install.sh on a
standard install is the easiest path: it repairs ownership and modes
on the entire data dir.
Probes look much slower than expected
ICMP first:
- Default ICMP is unprivileged UDP-mode pings, not raw ICMP. UDP
ping is a bit slower and may hit different kernel paths. For
reference latency, grant
CAP_NET_RAW.
HTTP / TCP:
intervalandtimeoutare the only knobs incluster.yaml. The check is run synchronously per worker; if your target takes 9 s to respond and your timeout is 10 s, the next probe doesn't start until ~9 s elapsed. Increase concurrency by adding more fast-interval checks against the same target, not by lowering timeout (which will just produce falsedownresults).
I want to start over
sudo systemctl stop quptime
sudo rm -rf /etc/quptime
sudo -u quptime qu init --advertise <addr>
sudo systemctl start quptime
The data directory is the only state. Wipe it and you're back to a fresh node.
Under Docker (or any env-driven deploy), the explicit qu init step
isn't needed — wiping the data volume and restarting the container is
enough; qu serve will re-bootstrap from the QUPTIME_* env vars.