202 lines
6.8 KiB
Markdown
202 lines
6.8 KiB
Markdown
# qu — quorum-based uptime monitor
|
||
|
||
`qu` is a small Linux daemon that watches HTTP, TCP, and ICMP endpoints
|
||
from several cooperating nodes. The nodes form a quorum cluster; one is
|
||
elected master and owns alert dispatch. A check is only reported as
|
||
**DOWN** when the majority of nodes agree, which keeps a single node's
|
||
flaky uplink from paging anyone at 3am.
|
||
|
||
A single static binary contains the daemon, the CLI, and everything in
|
||
between. Inter-node traffic is mutual TLS with SSH-style fingerprint
|
||
trust — no central CA, no shared secret.
|
||
|
||
## Why
|
||
|
||
Most uptime monitors are either a SaaS or a single box that, by
|
||
definition, can't tell you when it's the one that's down. `qu` solves
|
||
both: run it on a few cheap hosts in different networks and they vote
|
||
on truth. If one of them loses its uplink, the rest keep alerting.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
+-------------- node A ---------------+
|
||
| qu serve |
|
||
| ├─ transport server (mTLS :9001) |
|
||
| ├─ quorum manager (heartbeats) |
|
||
| ├─ replicator (cluster.yaml) |
|
||
| ├─ scheduler (HTTP/TCP/ICMP) | <─── probes targets
|
||
| ├─ aggregator (master-only) |
|
||
| ├─ alerts (master-only) |
|
||
| └─ control socket (unix, for CLI) |
|
||
+-------------------------------------+
|
||
│ ▲ mTLS, pinned by fingerprint
|
||
▼ │
|
||
node B node C …
|
||
```
|
||
|
||
Every node runs every probe. Results are shipped to the elected master,
|
||
which folds them into a per-check sliding window. A state flips (UP↔DOWN)
|
||
only after **two consecutive aggregate evaluations** agree — that's
|
||
the hysteresis that absorbs network blips.
|
||
|
||
Master election is deterministic: among the live members of the quorum,
|
||
the node with the lexicographically smallest NodeID wins. No
|
||
negotiation, no split-brain window.
|
||
|
||
## Build
|
||
|
||
Requires Go 1.23 or newer.
|
||
|
||
```sh
|
||
go build -o qu ./cmd/qu
|
||
```
|
||
|
||
## Set up a 3-node cluster
|
||
|
||
On each host:
|
||
|
||
```sh
|
||
# 1. Generate identity + RSA-3072 keypair + self-signed cert.
|
||
qu init --advertise <this-host's reachable address>:9001
|
||
|
||
# 2. Start the daemon (foreground; wire it into systemd for prod).
|
||
qu serve
|
||
```
|
||
|
||
Pick one node and tell it about the other two. The CLI prints the
|
||
remote fingerprint and asks for confirmation, SSH-style:
|
||
|
||
```sh
|
||
qu node add bravo.example.com:9001
|
||
qu node add charlie.example.com:9001
|
||
```
|
||
|
||
That's it — the master broadcasts the new cluster config to every
|
||
trusting peer. `qu status` from any node should now show all three:
|
||
|
||
```
|
||
node a7f3...
|
||
term 2
|
||
master a7f3...
|
||
quorum true (need 2)
|
||
config ver 4
|
||
|
||
PEERS
|
||
NODE_ID ADVERTISE LIVE LAST_SEEN
|
||
a7f3... alpha.example.com:9001 true 2026-05-12T15:01:32Z
|
||
b21c... bravo.example.com:9001 true 2026-05-12T15:01:32Z
|
||
c0d4... charlie.example.com:9001 true 2026-05-12T15:01:32Z
|
||
```
|
||
|
||
## Adding checks and alerts
|
||
|
||
```sh
|
||
# alerts first so checks can reference them
|
||
qu alert add discord oncall --webhook https://discord.com/api/webhooks/...
|
||
qu alert add smtp ops --host smtp.example.com --port 587 \
|
||
--from monitor@example.com --to ops@example.com \
|
||
--user mailbot --password '****' --starttls=true
|
||
|
||
# checks
|
||
qu check add http homepage https://example.com --expect 200 --alerts oncall,ops
|
||
qu check add tcp db db.internal:5432 --interval 15s
|
||
qu check add icmp gateway 10.0.0.1 --interval 5s
|
||
```
|
||
|
||
Mutations always route to the master, which bumps a monotonic version
|
||
and pushes the new `cluster.yaml` to every peer. If quorum is lost,
|
||
mutating commands fail loudly.
|
||
|
||
## Test an alert without waiting for a real outage
|
||
|
||
```sh
|
||
qu alert test oncall
|
||
```
|
||
|
||
## File layout
|
||
|
||
A node's state lives under `$QUPTIME_DIR` (defaults to `/etc/quptime`
|
||
when root, `~/.config/quptime` otherwise):
|
||
|
||
```
|
||
node.yaml identity (NodeID, bind addr, port). Never replicated.
|
||
cluster.yaml replicated state: peers, checks, alerts, version.
|
||
trust.yaml local fingerprint trust store.
|
||
keys/ RSA private + public + self-signed cert.
|
||
```
|
||
|
||
The CLI talks to the local daemon over a unix socket at
|
||
`$QUPTIME_SOCKET` (defaults to `/var/run/quptime/quptime.sock` when
|
||
root, `$XDG_RUNTIME_DIR/quptime/quptime.sock` otherwise) — filesystem
|
||
permissions guard it; no TLS on the local socket.
|
||
|
||
## ICMP and capabilities
|
||
|
||
ICMP checks default to unprivileged UDP-mode pings so the daemon does
|
||
not need root or `CAP_NET_RAW`. If you want classic raw ICMP, either
|
||
run the daemon as root or grant the capability:
|
||
|
||
```sh
|
||
sudo setcap cap_net_raw=+ep ./qu
|
||
```
|
||
|
||
## CLI reference
|
||
|
||
```
|
||
qu init generate identity + keys
|
||
qu serve run the daemon
|
||
qu status quorum, master, check states
|
||
qu node add <host:port> TOFU-add a peer
|
||
qu node list show peers + liveness
|
||
qu node remove <node-id> remove from cluster + trust
|
||
qu check add http <name> <url> [--expect 200] [--interval 30s] [--body-match str] [--alerts a,b]
|
||
qu check add tcp <name> <host:port>
|
||
qu check add icmp <name> <host>
|
||
qu check list
|
||
qu check remove <id-or-name>
|
||
qu alert add smtp <name> --host … --port … --from … --to … [--user --password --starttls]
|
||
qu alert add discord <name> --webhook …
|
||
qu alert list / remove / test <id-or-name>
|
||
qu trust list / remove <node-id>
|
||
```
|
||
|
||
All `--interval` and `--timeout` flags accept Go duration syntax: `5s`,
|
||
`1m30s`, `2h`, etc.
|
||
|
||
## Tests
|
||
|
||
```sh
|
||
go test ./...
|
||
go test -race ./...
|
||
```
|
||
|
||
Each internal package has unit tests; coverage hovers around 60–90 %
|
||
on the meaningful packages. The transport tests bring up real mTLS
|
||
listeners over loopback, which exercises the cert pinning end-to-end.
|
||
|
||
## What's intentionally not here (v1)
|
||
|
||
- No web UI. The CLI is the only operator surface.
|
||
- No historical metrics or SLA reports — only the current aggregate
|
||
state is kept in memory. Add SQLite later if you need graphs.
|
||
- No automatic key rotation. Re-init a node and re-trust if you need
|
||
to roll its identity.
|
||
- No multi-tenant isolation. One cluster = one set of checks.
|
||
|
||
## Layout
|
||
|
||
```
|
||
cmd/qu/ entry point
|
||
internal/config/ on-disk file layout, ClusterConfig, NodeConfig
|
||
internal/crypto/ RSA keypair + self-signed cert + SPKI fingerprints
|
||
internal/trust/ fingerprint trust store
|
||
internal/transport/ mTLS listener/dialer, framed JSON-RPC
|
||
internal/quorum/ heartbeats + deterministic master election
|
||
internal/replicate/ master-routed mutations, version-gated replication
|
||
internal/checks/ HTTP/TCP/ICMP probers, scheduler, aggregator
|
||
internal/alerts/ SMTP + Discord dispatchers, message rendering
|
||
internal/daemon/ glue: wires every component + control socket
|
||
internal/cli/ cobra commands, the user-facing surface
|
||
```
|