Axodouble/QUptime

Fork 0

Files

T

Axodouble f60b0a0609

Container image / image (push) Successful in 4m40s

Details

Added better documentation for fixing my own broken installs, and updated the install script to patch issues

2026-05-15 07:56:13 +00:00

9.0 KiB

Raw Blame History

Troubleshooting

The cluster is misbehaving. This page is organised by symptom. Each entry pairs the user-visible signal with the log line(s) you'll see in journalctl -u quptime and the fix.

`qu status` shows `quorum false`

What it means. Fewer than ⌈N/2⌉+1 peers are live.

Diagnose. Look at the PEERS table. The LIVE column tells you which peers this node has stopped hearing from.

If only this node is "live" and everyone else is not → this node is network-isolated. Test: nc -zv <peer-advertise>. Fix: network / firewall.
If multiple nodes show false → more than one peer is down. Look at the other peers' status outputs to triangulate.
If everyone is live but quorum false still → check cluster.yaml.peers length vs. live count; you may have phantom peer entries left over from a removed-but-not-evicted node. Fix: qu node remove <ghost-node-id> from any live node.

`qu status` shows `master (none — ...)`

What it means. Either no quorum (see above) or election is in flight. The latter clears within ~1 heartbeat.

If term is incrementing rapidly (watch qu status), the master is flapping. Causes:

The currently-elected master is unreachable from some peers but reachable from others, partial-partition style. Look for log lines on the suspected master about peers it can't reach.
Heartbeat timeouts (default 4s) are too tight for your inter-node link. Rebuild with a higher DefaultDeadAfter if you need it.

Primary master came back but the cluster hasn't switched to it

What it means. Working as designed. After a returning peer with a lower NodeID rejoins, the quorum manager waits DefaultMasterCooldown (2 minutes) before letting it displace the incumbent. The window prevents a self-monitoring master from flapping the role in lock-step with its own restart.

How to confirm:

qu status on every node shows the same (current) master and a steady term — not flapping. The lower-NodeID peer is in the live set but not yet master.
After ~2 minutes of continuous liveness, term bumps once and the master switches to the lower-NodeID peer.

If you need a different window, change DefaultMasterCooldown in internal/quorum/manager.go and rebuild.

A check is stuck in `unknown`

What it means. The aggregator has no fresh reports for that check.

Possible causes:

No node is actually running the probe yet. Probes start ~interval/10 after qu serve boots and reconcile every 5s. Wait 10s and re-check.
Nodes are submitting results but they're stale (older than 3× interval). Probably means probes are timing out without reporting.
This is a follower's view; the aggregator runs on the master only. Check qu status on the master to see the canonical view.

Alerts not firing

Walk this list in order; one of them will catch it:

Is there quorum? Aggregator runs on master only. No master → no transitions → no alerts.
Is the alert attached to the check? qu status shows the effective alert list per check. Empty → no alert. Confirm with qu alert list that the alert exists and (if relying on default attachment) has default: true.
Is the alert suppressed on this check? Check suppress_alert_ids in cluster.yaml.
Test the alert path directly:
```
sudo -u quptime qu alert test <name>
```
This bypasses the aggregator and renders a synthetic transition. If alert test doesn't deliver, the problem is the notifier config or the template — see below. If alert test works but real transitions don't, the aggregator isn't observing the transition.
Has the check actually transitioned? Aggregator commits a flip only after two consecutive evaluations agree. A bouncing target may never satisfy the hysteresis. Lower the check interval or increase reliability of the target.

Discord webhook returns 4xx

The dispatcher logs the HTTP body. Common causes:

Webhook revoked / channel deleted → 404. Re-issue and update discord_webhook.
Body too large → 400. Long templates that pull Snapshot.Detail with multi-line errors can blow past Discord's 2000-char limit. Shorten the template or trim the variable.
Rate-limited → 429. Reduce alert frequency or stop suppressing hysteresis.

SMTP refuses the message

Check the daemon log for smtp: lines. Most common:

530 5.7.0 Must issue a STARTTLS command first → set smtp_starttls: true on the alert.
535 Authentication failed → wrong smtp_user / smtp_password.
Connection refused / timeout → firewall between qu and the SMTP relay. Verify with openssl s_client -starttls smtp -connect host:587.

Manual edit to `cluster.yaml` was ignored

Symptoms: you edited the file, saved, nothing happened.

Look for one of these log lines:

manual-edit: parse cluster.yaml: <err> — ignoring → YAML is invalid. The daemon pins the bad hash and waits for the next valid save. Run the file through yq or python -c "import yaml,sys; yaml.safe_load(open(sys.argv[1]))" cluster.yaml to diagnose.
manual-edit: cluster.yaml changed externally — replicating via master followed by manual-edit: forward to master: no quorum → cluster has no quorum, can't accept the edit. Restore quorum first.
No log line at all → the on-disk content didn't change in a way that matters. The watcher compares only peers, checks, and alerts; whitespace and comment edits are accepted silently.

Two nodes disagree on `config ver`

The follower with the lower version should pull within one heartbeat. If after ~5 seconds the gap persists:

The follower might not have an advertise address for the higher- versioned peer. The version observer needs one to pull. Check cluster.yaml.peers for both sides' advertise fields.
The follower's TLS handshake against the higher-versioned peer is failing — look for replicate: pull from <id>: <err> lines.
The peer with the higher version is announcing it correctly but the follower is rejecting the ApplyClusterCfg broadcasts because of its own decode error — look for transport-layer errors instead.

"needs ≥2 live to mutate" rejection during bootstrap

You ran two qu node add commands back-to-back and the second one failed. The first add doesn't take effect until the new peer sends its first heartbeat (≤ 1 second); during that window the cluster has size 2 and quorum size 2, so a second peer add from a 1-live cluster looks like "mutate without quorum."

Fix: pause ~3 seconds between adds. The README and the systemd guide both call this out.

Daemon refuses to start

load node.yaml: open ...: no such file or directory

qu serve normally auto-bootstraps a missing node.yaml using the QUPTIME_* env vars (see configuration.md). If you still see this error, the most likely causes are:

The data directory is read-only or owned by a different user — the bootstrap can't write node.yaml. Fix permissions on $QUPTIME_DIR. The fastest fix on a standard install is just to re-run install.sh — it reasserts the canonical ownership and modes on the whole tree without touching your config.
Something else removed node.yaml mid-run (a config-management tool, a misconfigured volume). Re-run qu serve and it will rebuild from env, or run qu init manually with the flags you want.

node.yaml has empty node_id — run `qu init` first

node.yaml exists but lacks a node_id. Either delete the file and let auto-init regenerate it, or run qu init against a wiped data dir.

listen tcp :9901: bind: address already in use

Another process owns the port. ss -tlnp | grep :9901 to find it.

load private key: ...

Permissions on keys/private.pem are wrong — should be 0600 and owned by the daemon user. Fix and restart. Re-running install.sh on a standard install is the easiest path: it repairs ownership and modes on the entire data dir.

Probes look much slower than expected

ICMP first:

Default ICMP is unprivileged UDP-mode pings, not raw ICMP. UDP ping is a bit slower and may hit different kernel paths. For reference latency, grant CAP_NET_RAW.

HTTP / TCP:

interval and timeout are the only knobs in cluster.yaml. The check is run synchronously per worker; if your target takes 9 s to respond and your timeout is 10 s, the next probe doesn't start until ~9 s elapsed. Increase concurrency by adding more fast-interval checks against the same target, not by lowering timeout (which will just produce false down results).

I want to start over

sudo systemctl stop quptime
sudo rm -rf /etc/quptime
sudo -u quptime qu init --advertise <addr>
sudo systemctl start quptime

The data directory is the only state. Wipe it and you're back to a fresh node.

Under Docker (or any env-driven deploy), the explicit qu init step isn't needed — wiping the data volume and restarting the container is enough; qu serve will re-bootstrap from the QUPTIME_* env vars.

9.0 KiB Raw Blame History Unescape Escape