6.8 KiB
qu — quorum-based uptime monitor
qu is a small Linux daemon that watches HTTP, TCP, and ICMP endpoints
from several cooperating nodes. The nodes form a quorum cluster; one is
elected master and owns alert dispatch. A check is only reported as
DOWN when the majority of nodes agree, which keeps a single node's
flaky uplink from paging anyone at 3am.
A single static binary contains the daemon, the CLI, and everything in between. Inter-node traffic is mutual TLS with SSH-style fingerprint trust — no central CA, no shared secret.
Why
Most uptime monitors are either a SaaS or a single box that, by
definition, can't tell you when it's the one that's down. qu solves
both: run it on a few cheap hosts in different networks and they vote
on truth. If one of them loses its uplink, the rest keep alerting.
Architecture
+-------------- node A ---------------+
| qu serve |
| ├─ transport server (mTLS :9001) |
| ├─ quorum manager (heartbeats) |
| ├─ replicator (cluster.yaml) |
| ├─ scheduler (HTTP/TCP/ICMP) | <─── probes targets
| ├─ aggregator (master-only) |
| ├─ alerts (master-only) |
| └─ control socket (unix, for CLI) |
+-------------------------------------+
│ ▲ mTLS, pinned by fingerprint
▼ │
node B node C …
Every node runs every probe. Results are shipped to the elected master, which folds them into a per-check sliding window. A state flips (UP↔DOWN) only after two consecutive aggregate evaluations agree — that's the hysteresis that absorbs network blips.
Master election is deterministic: among the live members of the quorum, the node with the lexicographically smallest NodeID wins. No negotiation, no split-brain window.
Build
Requires Go 1.23 or newer.
go build -o qu ./cmd/qu
Set up a 3-node cluster
On each host:
# 1. Generate identity + RSA-3072 keypair + self-signed cert.
qu init --advertise <this-host's reachable address>:9001
# 2. Start the daemon (foreground; wire it into systemd for prod).
qu serve
Pick one node and tell it about the other two. The CLI prints the remote fingerprint and asks for confirmation, SSH-style:
qu node add bravo.example.com:9001
qu node add charlie.example.com:9001
That's it — the master broadcasts the new cluster config to every
trusting peer. qu status from any node should now show all three:
node a7f3...
term 2
master a7f3...
quorum true (need 2)
config ver 4
PEERS
NODE_ID ADVERTISE LIVE LAST_SEEN
a7f3... alpha.example.com:9001 true 2026-05-12T15:01:32Z
b21c... bravo.example.com:9001 true 2026-05-12T15:01:32Z
c0d4... charlie.example.com:9001 true 2026-05-12T15:01:32Z
Adding checks and alerts
# alerts first so checks can reference them
qu alert add discord oncall --webhook https://discord.com/api/webhooks/...
qu alert add smtp ops --host smtp.example.com --port 587 \
--from monitor@example.com --to ops@example.com \
--user mailbot --password '****' --starttls=true
# checks
qu check add http homepage https://example.com --expect 200 --alerts oncall,ops
qu check add tcp db db.internal:5432 --interval 15s
qu check add icmp gateway 10.0.0.1 --interval 5s
Mutations always route to the master, which bumps a monotonic version
and pushes the new cluster.yaml to every peer. If quorum is lost,
mutating commands fail loudly.
Test an alert without waiting for a real outage
qu alert test oncall
File layout
A node's state lives under $QUPTIME_DIR (defaults to /etc/quptime
when root, ~/.config/quptime otherwise):
node.yaml identity (NodeID, bind addr, port). Never replicated.
cluster.yaml replicated state: peers, checks, alerts, version.
trust.yaml local fingerprint trust store.
keys/ RSA private + public + self-signed cert.
The CLI talks to the local daemon over a unix socket at
$QUPTIME_SOCKET (defaults to /var/run/quptime/quptime.sock when
root, $XDG_RUNTIME_DIR/quptime/quptime.sock otherwise) — filesystem
permissions guard it; no TLS on the local socket.
ICMP and capabilities
ICMP checks default to unprivileged UDP-mode pings so the daemon does
not need root or CAP_NET_RAW. If you want classic raw ICMP, either
run the daemon as root or grant the capability:
sudo setcap cap_net_raw=+ep ./qu
CLI reference
qu init generate identity + keys
qu serve run the daemon
qu status quorum, master, check states
qu node add <host:port> TOFU-add a peer
qu node list show peers + liveness
qu node remove <node-id> remove from cluster + trust
qu check add http <name> <url> [--expect 200] [--interval 30s] [--body-match str] [--alerts a,b]
qu check add tcp <name> <host:port>
qu check add icmp <name> <host>
qu check list
qu check remove <id-or-name>
qu alert add smtp <name> --host … --port … --from … --to … [--user --password --starttls]
qu alert add discord <name> --webhook …
qu alert list / remove / test <id-or-name>
qu trust list / remove <node-id>
All --interval and --timeout flags accept Go duration syntax: 5s,
1m30s, 2h, etc.
Tests
go test ./...
go test -race ./...
Each internal package has unit tests; coverage hovers around 60–90 % on the meaningful packages. The transport tests bring up real mTLS listeners over loopback, which exercises the cert pinning end-to-end.
What's intentionally not here (v1)
- No web UI. The CLI is the only operator surface.
- No historical metrics or SLA reports — only the current aggregate state is kept in memory. Add SQLite later if you need graphs.
- No automatic key rotation. Re-init a node and re-trust if you need to roll its identity.
- No multi-tenant isolation. One cluster = one set of checks.
Layout
cmd/qu/ entry point
internal/config/ on-disk file layout, ClusterConfig, NodeConfig
internal/crypto/ RSA keypair + self-signed cert + SPKI fingerprints
internal/trust/ fingerprint trust store
internal/transport/ mTLS listener/dialer, framed JSON-RPC
internal/quorum/ heartbeats + deterministic master election
internal/replicate/ master-routed mutations, version-gated replication
internal/checks/ HTTP/TCP/ICMP probers, scheduler, aggregator
internal/alerts/ SMTP + Discord dispatchers, message rendering
internal/daemon/ glue: wires every component + control socket
internal/cli/ cobra commands, the user-facing surface