Files
QUptime/README.md
T
2026-05-12 06:57:08 +00:00

221 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# qu — quorum-based uptime monitor
`qu` is a small Linux daemon that watches HTTP, TCP, and ICMP endpoints
from several cooperating nodes. The nodes form a quorum cluster; one is
elected master and owns alert dispatch. A check is only reported as
**DOWN** when the majority of nodes agree, which keeps a single node's
flaky uplink from paging anyone at 3am.
A single static binary contains the daemon, the CLI, and everything in
between. Inter-node traffic is mutual TLS with SSH-style fingerprint
trust — no central CA, no shared secret.
## Why
Most uptime monitors are either a SaaS or a single box that, by
definition, can't tell you when it's the one that's down. `qu` solves
both: run it on a few cheap hosts in different networks and they vote
on truth. If one of them loses its uplink, the rest keep alerting.
## Architecture
```
+-------------- node A ---------------+
| qu serve |
| ├─ transport server (mTLS :9001) |
| ├─ quorum manager (heartbeats) |
| ├─ replicator (cluster.yaml) |
| ├─ scheduler (HTTP/TCP/ICMP) | <─── probes targets
| ├─ aggregator (master-only) |
| ├─ alerts (master-only) |
| └─ control socket (unix, for CLI) |
+-------------------------------------+
│ ▲ mTLS, pinned by fingerprint
▼ │
node B node C …
```
Every node runs every probe. Results are shipped to the elected master,
which folds them into a per-check sliding window. A state flips (UP↔DOWN)
only after **two consecutive aggregate evaluations** agree — that's
the hysteresis that absorbs network blips.
Master election is deterministic: among the live members of the quorum,
the node with the lexicographically smallest NodeID wins. No
negotiation, no split-brain window.
## Build
Requires Go 1.23 or newer.
```sh
go build -o qu ./cmd/qu
```
To stamp the version into the binary:
```sh
go build -ldflags "-X main.version=v0.1.0" -o qu ./cmd/qu
qu --version
```
## Releases
Pushing a tag matching `v*` triggers `.gitea/workflows/release.yaml`,
which runs the test suite, cross-compiles static Linux binaries for
amd64 and arm64, and publishes them as a Gitea release with a
`SHA256SUMS` file alongside.
```sh
git tag v0.1.0
git push --tags
```
## Set up a 3-node cluster
On each host:
```sh
# 1. Generate identity + RSA-3072 keypair + self-signed cert.
qu init --advertise <this-host's reachable address>:9001
# 2. Start the daemon (foreground; wire it into systemd for prod).
qu serve
```
Pick one node and tell it about the other two. The CLI prints the
remote fingerprint and asks for confirmation, SSH-style:
```sh
qu node add bravo.example.com:9001
qu node add charlie.example.com:9001
```
That's it — the master broadcasts the new cluster config to every
trusting peer. `qu status` from any node should now show all three:
```
node a7f3...
term 2
master a7f3...
quorum true (need 2)
config ver 4
PEERS
NODE_ID ADVERTISE LIVE LAST_SEEN
a7f3... alpha.example.com:9001 true 2026-05-12T15:01:32Z
b21c... bravo.example.com:9001 true 2026-05-12T15:01:32Z
c0d4... charlie.example.com:9001 true 2026-05-12T15:01:32Z
```
## Adding checks and alerts
```sh
# alerts first so checks can reference them
qu alert add discord oncall --webhook https://discord.com/api/webhooks/...
qu alert add smtp ops --host smtp.example.com --port 587 \
--from monitor@example.com --to ops@example.com \
--user mailbot --password '****' --starttls=true
# checks
qu check add http homepage https://example.com --expect 200 --alerts oncall,ops
qu check add tcp db db.internal:5432 --interval 15s
qu check add icmp gateway 10.0.0.1 --interval 5s
```
Mutations always route to the master, which bumps a monotonic version
and pushes the new `cluster.yaml` to every peer. If quorum is lost,
mutating commands fail loudly.
## Test an alert without waiting for a real outage
```sh
qu alert test oncall
```
## File layout
A node's state lives under `$QUPTIME_DIR` (defaults to `/etc/quptime`
when root, `~/.config/quptime` otherwise):
```
node.yaml identity (NodeID, bind addr, port). Never replicated.
cluster.yaml replicated state: peers, checks, alerts, version.
trust.yaml local fingerprint trust store.
keys/ RSA private + public + self-signed cert.
```
The CLI talks to the local daemon over a unix socket at
`$QUPTIME_SOCKET` (defaults to `/var/run/quptime/quptime.sock` when
root, `$XDG_RUNTIME_DIR/quptime/quptime.sock` otherwise) — filesystem
permissions guard it; no TLS on the local socket.
## ICMP and capabilities
ICMP checks default to unprivileged UDP-mode pings so the daemon does
not need root or `CAP_NET_RAW`. If you want classic raw ICMP, either
run the daemon as root or grant the capability:
```sh
sudo setcap cap_net_raw=+ep ./qu
```
## CLI reference
```
qu init generate identity + keys
qu serve run the daemon
qu status quorum, master, check states
qu node add <host:port> TOFU-add a peer
qu node list show peers + liveness
qu node remove <node-id> remove from cluster + trust
qu check add http <name> <url> [--expect 200] [--interval 30s] [--body-match str] [--alerts a,b]
qu check add tcp <name> <host:port>
qu check add icmp <name> <host>
qu check list
qu check remove <id-or-name>
qu alert add smtp <name> --host … --port … --from … --to … [--user --password --starttls]
qu alert add discord <name> --webhook …
qu alert list / remove / test <id-or-name>
qu trust list / remove <node-id>
```
All `--interval` and `--timeout` flags accept Go duration syntax: `5s`,
`1m30s`, `2h`, etc.
## Tests
```sh
go test ./...
go test -race ./...
```
Each internal package has unit tests; coverage hovers around 6090 %
on the meaningful packages. The transport tests bring up real mTLS
listeners over loopback, which exercises the cert pinning end-to-end.
## What's intentionally not here (v1)
- No web UI. The CLI is the only operator surface.
- No historical metrics or SLA reports — only the current aggregate
state is kept in memory. Add SQLite later if you need graphs.
- No automatic key rotation. Re-init a node and re-trust if you need
to roll its identity.
- No multi-tenant isolation. One cluster = one set of checks.
## Layout
```
cmd/qu/ entry point
internal/config/ on-disk file layout, ClusterConfig, NodeConfig
internal/crypto/ RSA keypair + self-signed cert + SPKI fingerprints
internal/trust/ fingerprint trust store
internal/transport/ mTLS listener/dialer, framed JSON-RPC
internal/quorum/ heartbeats + deterministic master election
internal/replicate/ master-routed mutations, version-gated replication
internal/checks/ HTTP/TCP/ICMP probers, scheduler, aggregator
internal/alerts/ SMTP + Discord dispatchers, message rendering
internal/daemon/ glue: wires every component + control socket
internal/cli/ cobra commands, the user-facing surface
```