# qu — quorum-based uptime monitor `qu` is a small Linux daemon that watches HTTP, TCP, and ICMP endpoints from several cooperating nodes. The nodes form a quorum cluster; one is elected master and owns alert dispatch. A check is only reported as **DOWN** when the majority of nodes agree, which keeps a single node's flaky uplink from paging anyone at 3am. A single static binary contains the daemon, the CLI, and everything in between. Inter-node traffic is mutual TLS with SSH-style fingerprint trust — no central CA, no shared secret. ## Installation ### From pre-built binary The canonical home is Gitea; the repo is push-mirrored to GitHub on every tag. Releases and multi-arch container images are published to both. | Source | Releases | Container image | | ---------------- | ------------------------------------------------------------ | -------------------------------- | | Gitea (primary) | | `git.cer.sh/axodouble/quptime` | | GitHub (mirror) | | `ghcr.io/axodouble/quptime` | One-step install — tries Gitea first, falls back to GitHub automatically: ```sh curl -fsSL https://git.cer.sh/Axodouble/QUptime/raw/branch/master/install.sh | sudo bash # or, via the GitHub mirror: # curl -fsSL https://raw.githubusercontent.com/Axodouble/QUptime/master/install.sh | sudo bash ``` The script verifies the binary against the published `SHA256SUMS` before installing and refuses to proceed on a mismatch. ### From Docker ```sh docker pull git.cer.sh/axodouble/quptime:latest # or, via the GitHub mirror: # docker pull ghcr.io/axodouble/quptime:latest ``` See [docs/deployment/docker.md](docs/deployment/docker.md) for compose recipes. ## Why Most uptime monitors are either a SaaS or a single box that, by definition, can't tell you when it's the one that's down. `qu` solves both: run it on a few cheap hosts in different networks and they vote on truth. If one of them loses its uplink, the rest keep alerting. ## Documentation This README is the quick-start. For production use, the longer guides live under [`docs/`](docs/README.md): | If you want to… | Read | | ----------------------------------------------------- | ------------------------------------------------------------------ | | understand the consensus / replication model | [docs/architecture.md](docs/architecture.md) | | reference every field in `node.yaml` / `cluster.yaml` | [docs/configuration.md](docs/configuration.md) | | deploy on Linux with systemd hardening | [docs/deployment/systemd.md](docs/deployment/systemd.md) | | deploy with Docker / docker-compose | [docs/deployment/docker.md](docs/deployment/docker.md) | | deploy over Tailscale or WireGuard | [docs/deployment/tailscale.md](docs/deployment/tailscale.md) | | expose `qu` on the open internet safely | [docs/deployment/public-internet.md](docs/deployment/public-internet.md) | | upgrade, back up, or recover from failures | [docs/operations.md](docs/operations.md) | | understand the trust model and rotate identities | [docs/security.md](docs/security.md) | | diagnose a misbehaving cluster | [docs/troubleshooting.md](docs/troubleshooting.md) | ## Architecture ``` +-------------- node A ---------------+ | qu serve | | ├─ transport server (mTLS :9901) | | ├─ quorum manager (heartbeats) | | ├─ replicator (cluster.yaml) | | ├─ scheduler (HTTP/TCP/ICMP) | <─── probes targets | ├─ aggregator (master-only) | | ├─ alerts (master-only) | | └─ control socket (unix, for CLI) | +-------------------------------------+ │ ▲ mTLS, pinned by fingerprint ▼ │ node B node C … ``` Every node runs every probe. Results are shipped to the elected master, which folds them into a per-check sliding window. A state flips (UP↔DOWN) only after **two consecutive aggregate evaluations** agree — that's the hysteresis that absorbs network blips. Master election is deterministic: among the live members of the quorum, the node with the lexicographically smallest NodeID wins. No negotiation, no split-brain window. A 2-minute **master cooldown** keeps the current master in place until a returning lower-NodeID peer has been continuously live for the full window, so a self-monitoring master that briefly drops doesn't flap the role back the instant it reappears. `cluster.yaml` is the single replicated source of truth (peers, checks, alerts). Mutations from the CLI route through the master, which bumps a monotonic version and broadcasts the result. The same file is also watched on disk, so an operator can `sudoedit cluster.yaml` on any node and the daemon will replicate the edit cluster-wide. ## Build Requires Go 1.24.2 or newer. ```sh go build -o qu ./cmd/qu ``` To stamp the version into the binary: ```sh go build -ldflags "-X main.version=v0.0.1" -o qu ./cmd/qu qu --version ``` ## Releases Pushing a tag matching `v*` triggers `.gitea/workflows/release.yaml`, which runs the test suite, cross-compiles static Linux binaries for amd64 and arm64, and publishes them as a Gitea release with a `SHA256SUMS` file alongside. ```sh git tag v0.0.1 git push --tags ``` ## Set up a 3-node cluster On the **first host**: ```sh qu init --advertise alpha.example.com:9901 ``` That prints a random cluster secret. Copy it. On every **other host**, pass that secret via `--secret`: ```sh qu init --advertise bravo.example.com:9901 --secret qu init --advertise charlie.example.com:9901 --secret ``` Without the matching secret a node cannot join, so random hosts that can reach :9901 are safely ignored. Start the daemon on every host (foreground; wire into systemd for prod): ```sh qu serve ``` Then on one node — usually `alpha` — invite the others. The CLI prints each remote's fingerprint and asks for confirmation SSH-style: ```sh qu node add bravo.example.com:9901 qu node add charlie.example.com:9901 ``` After the first invite, give it a few seconds for heartbeats to bring the new peer into the live set before inviting the next one — otherwise the local node's "needs ≥2 live to mutate" check will reject the second add. You only need to invite from one node. Peer certs ride along with the replicated `cluster.yaml`, so every peer auto-trusts every other peer without `N×(N-1)` invites. That's it — the master broadcasts the new cluster config to every trusting peer. `qu status` from any node should now show all three: ``` node a7f3... term 2 master a7f3... quorum true (need 2) config ver 4 PEERS NODE_ID ADVERTISE LIVE LAST_SEEN a7f3... alpha.example.com:9901 true 2026-05-12T15:01:32Z b21c... bravo.example.com:9901 true 2026-05-12T15:01:32Z c0d4... charlie.example.com:9901 true 2026-05-12T15:01:32Z ``` ## Adding checks and alerts > ⚠️ **Alert credentials are replicated cluster-wide.** SMTP passwords > and Discord webhook URLs live in `cluster.yaml`, which is mirrored to > every node. Any node that can read its own data directory can read > every alert secret. Treat compromising one node as compromising every > alert credential, and restrict who can reach `$QUPTIME_DIR` on each > host (the hardened systemd unit and the Docker image both default to > `0700`/`0750`). See [docs/security.md](docs/security.md) for the full > threat model. ```sh # alerts first so checks can reference them qu alert add discord oncall --webhook https://discord.com/api/webhooks/... qu alert add smtp ops --host smtp.example.com --port 587 \ --from monitor@example.com --to ops@example.com \ --user mailbot --password '****' --starttls=true # checks qu check add http homepage https://example.com --expect 200 --alerts oncall,ops qu check add tcp db db.internal:5432 --interval 15s qu check add icmp gateway 10.0.0.1 --interval 5s ``` Mutations always route to the master, which bumps a monotonic version and pushes the new `cluster.yaml` to every peer. If quorum is lost, mutating commands fail loudly. `qu status` shows the effective alert list for each check. Default alerts are suffixed with `*` so you can tell at a glance which alerts were attached automatically vs explicitly listed on the check: ``` CHECKS ID NAME STATE OK/TOTAL ALERTS DETAIL ddbd... homepage up 3/3 oncall,ops* 0006... db down 1/3 ops* dial timeout 24f4... gateway up 3/3 - (alerts marked * are attached as defaults) ``` ## Default alerts (attach to every check) Rather than listing the same `--alerts` on every `check add`, mark an alert as default and it fires for every check automatically: ```sh # at creation qu alert add discord oncall --webhook https://... --default # or toggle later qu alert default oncall on qu alert default oncall off ``` `qu alert list` shows a DEFAULT column. A check can opt out of a specific default by adding the alert's ID or name to its `suppress_alert_ids` list in `cluster.yaml` (see "Edit cluster.yaml directly" below). ## Interactive TUI Prefer a dashboard over typing commands? `qu tui` opens a full-screen [bubbletea](https://github.com/charmbracelet/bubbletea) UI over the local daemon socket. The header shows quorum, master, term, and config version; three tabs hold peers, checks, and alerts with auto-refresh every two seconds. ``` ┌─ QUptime ── node: 88a00af9 master: 3438fd6f (follower) ● quorum 3/2 term 4 ver 10 ──┐ │ Peers (3) [2] Checks (3) [3] Alerts (1) │ ├──────────────────────────────────────────────────────────────────────────────────────────────┤ │ ID NAME STATE OK/TOTAL ALERTS DETAIL │ │ ddbd... homepage ● up 3/3 oncall* │ │ 0006... db ● down 1/3 oncall* dial timeout │ │ 24f4... gateway ○ unknown 0/0 - │ └──────────────────────────────────────────────────────────────────────────────────────────────┘ ↑↓ navigate ⇥ next tab 1/2/3 jump r refresh a add check d remove check q quit ``` Keybindings: | Key | Action | | ------------------- | --------------------------------------------------------- | | `↑` / `↓` | move cursor within a tab | | `Tab` / `Shift+Tab` | next / previous tab | | `1` / `2` / `3` | jump to Peers / Checks / Alerts | | `r` | force-refresh | | `a` | add (opens a picker on Checks/Alerts; node form on Peers) | | `d` | remove the selected row (confirmation prompt) | | `t` | send a test message to the selected alert | | `D` | toggle the selected alert's `default` flag | | `q` / `Ctrl+C` | quit | Forms run the same control-plane methods the CLI does, so any side effect (a mutation, a node add, an alert test) ends up routed through the master exactly like `qu …` from the shell. ## Custom alert messages Each alert can carry its own `subject_template` and `body_template` (Go `text/template` syntax). When set, they override the built-in formatting for that one alert; the default renderer is used otherwise. Discord ignores the subject template (it has no subject line). ```sh qu alert add discord oncall --webhook https://... \ --body ':rotating_light: **{{.Check.Name}}** is now {{.Verb}} target: `{{.Check.Target}}` detail: {{.Snapshot.Detail}}' # multi-line templates are easier from a file qu alert add smtp ops --host ... --from ... --to ... \ --subject-file /etc/quptime/templates/ops.subject \ --body-file /etc/quptime/templates/ops.body ``` Available template variables: | Variable | Meaning | | ----------------------- | ------------------------------------------ | | `{{.Check.Name}}` | check name | | `{{.Check.Type}}` | `http` / `tcp` / `icmp` | | `{{.Check.Target}}` | URL or host:port being probed | | `{{.Check.ID}}` | UUID | | `{{.From}}` | previous state (`up` / `down` / `unknown`) | | `{{.To}}` | new state | | `{{.Verb}}` | `UP` / `DOWN` / `RECOVERED` | | `{{.VerbLower}}` | lowercase form (`up` / `down` / `recovered`) | | `{{.Snapshot.Reports}}` | total per-node reports counted | | `{{.Snapshot.OKCount}}` | how many reported OK | | `{{.Snapshot.NotOK}}` | how many reported failure | | `{{.Snapshot.Detail}}` | first failure detail string | | `{{.NodeID}}` | master that dispatched | | `{{.When}}` | RFC3339 timestamp | The same variable list is surfaced in-app: `qu alert add smtp --help`, `qu alert add discord --help`, and `qu alert edit --help` each print it under their flag table, and `qu tui` shows a compact reminder of the supported variables as a hint when the cursor lands on a Subject or Body template field in the add/edit alert forms. `qu alert test ` exercises the template against a synthetic "homepage going DOWN" transition, so you can verify rendering before production traffic depends on it. A template parse or execution error falls back to the built-in format and is logged. ### Conditionals, pipelines, and worked examples Templates use Go's `text/template` syntax, so you have `if`/`else if`/ `else`/`end`, comparison helpers (`eq`, `ne`, `lt`, `gt`), `printf` pipelines, and `with` blocks. The default rendering — the one used when no custom template is set — lives in `internal/alerts/message.go` inside the `Render` function; tweak it there if you want to change what every alert without an override produces. A few progressively richer examples: **1. State-specific Discord copy** — different tone for `DOWN`, `RECOVERED`, and first-time `UP`: ```yaml body_template: | {{if eq .Verb "DOWN"}}:rotating_light: **{{.Check.Name}}** is DOWN We're investigating. Last detail: `{{.Snapshot.Detail}}` {{else if eq .Verb "RECOVERED"}}:white_check_mark: **{{.Check.Name}}** is back UP after a {{.From}} blip. {{else}}:information_source: **{{.Check.Name}}** is online ({{.VerbLower}}).{{end}} ``` **2. SMTP subject with severity prefix and run-length detail** — pipes `Verb` through `printf` for padding and only mentions the report count when it actually matters: ```yaml subject_template: '[{{printf "%-9s" .Verb}}] {{.Check.Name}} — {{.Check.Target}}' body_template: | Check: {{.Check.Name}} ({{.Check.Type}}) Target: {{.Check.Target}} Status: {{.Verb}} (was {{.From}}) Reporter: {{.NodeID}} At: {{.When}} {{if gt .Snapshot.Reports 1}} Quorum: {{.Snapshot.OKCount}} ok / {{.Snapshot.NotOK}} failing across {{.Snapshot.Reports}} reports. {{end}}{{with .Snapshot.Detail}} Detail: {{.}} {{end}} ``` **3. PagerDuty-style severity routing** — nest `if`/`else if` so a single template can produce three different first lines without duplicating the rest of the body: ```yaml subject_template: >- {{if eq .Verb "DOWN"}}P1: {{.Check.Name}} hard down {{else if eq .Verb "RECOVERED"}}P3: {{.Check.Name}} recovered {{else}}P4: {{.Check.Name}} {{.VerbLower}}{{end}} body_template: | {{/* Header line — uses .VerbLower so the prose reads naturally */}} {{.Check.Name}} ({{.Check.Target}}) is now {{.VerbLower}}. {{if eq .Verb "DOWN"-}} This is a real outage. Quorum: {{.Snapshot.NotOK}}/{{.Snapshot.Reports}} reporters see it failing. Detail from the first failing probe: {{.Snapshot.Detail}} Acknowledge in the runbook before paging on-call. {{- else if eq .Verb "RECOVERED" -}} Recovered after a {{.From}} period. No action needed; this is informational. {{- else -}} First successful probe after {{.From}}. Marking healthy. {{- end}} — {{.NodeID}} at {{.When}} ``` The `{{-` / `-}}` trim adjacent whitespace, which keeps the rendered output tidy even when the template itself is indented for readability. If a template fails to parse or panics at execute time, the dispatcher falls back to the default `Render` output for that field and logs the error — your alert still ships, you just lose the custom formatting until you fix the template. ## Edit cluster.yaml directly Anything you can do through the CLI you can also do by editing `$QUPTIME_DIR/cluster.yaml` on any node. The daemon polls the file every few seconds; when it sees a hash that differs from what it last wrote, it parses the YAML and forwards the change through the master, which bumps the version and broadcasts the result everywhere — so a hand-edit on `bravo` propagates to `alpha` and `charlie` automatically. ```sh sudoedit /etc/quptime/cluster.yaml # add `default: true` to an alert, or `suppress_alert_ids: [oncall]` # on a check, then save and quit ``` You'll see a `manual-edit: cluster.yaml changed externally — replicating via master` line in the daemon log when it picks the change up. Invalid YAML is logged and ignored until you save a valid file. The replicated fields are `peers`, `checks`, and `alerts`. `version`, `updated_at`, and `updated_by` are server-controlled — the master overwrites them on commit. ## Test an alert without waiting for a real outage ```sh qu alert test oncall ``` ## File layout A node's state lives under `$QUPTIME_DIR` (defaults to `/etc/quptime` when root, `~/.config/quptime` otherwise): ``` node.yaml identity (NodeID, bind addr, port). Never replicated. cluster.yaml replicated state: peers, checks, alerts, version. trust.yaml local fingerprint trust store. keys/ RSA private + public + self-signed cert. ``` The CLI talks to the local daemon over a unix socket at `$QUPTIME_SOCKET` (defaults to `/var/run/quptime/quptime.sock` when root, `$XDG_RUNTIME_DIR/quptime/quptime.sock` otherwise) — filesystem permissions guard it; no TLS on the local socket. ## ICMP and capabilities ICMP checks default to unprivileged UDP-mode pings so the daemon does not need root or `CAP_NET_RAW`. If you want classic raw ICMP, either run the daemon as root or grant the capability: ```sh sudo setcap cap_net_raw=+ep ./qu ``` ## CLI reference ``` qu init generate identity + keys qu serve run the daemon qu status quorum, master, check states qu tui interactive dashboard qu node add TOFU-add a peer qu node list show peers + liveness qu node remove remove from cluster + trust qu check add http [--expect 200] [--interval 30s] [--body-match str] [--alerts a,b] qu check add tcp qu check add icmp qu check list qu check remove qu alert add smtp --host … --port … --from … --to … [--user --password --starttls] [--default] [--subject … --body …] qu alert add discord --webhook … [--default] [--body …] qu alert list / remove / test qu alert default on|off toggle default attachment to every check qu trust list / remove ``` All `--interval` and `--timeout` flags accept Go duration syntax: `5s`, `1m30s`, `2h`, etc. ## Tests ```sh go test ./... go test -race ./... ``` Each internal package has unit tests; coverage hovers around 60–90 % on the meaningful packages. The transport tests bring up real mTLS listeners over loopback, which exercises the cert pinning end-to-end. ## What's intentionally not here (v1) - No web UI. The CLI is the only operator surface. - No historical metrics or SLA reports — only the current aggregate state is kept in memory. Add SQLite later if you need graphs. - No automatic key rotation. Re-init a node and re-trust if you need to roll its identity. - No multi-tenant isolation. One cluster = one set of checks. ## Layout ``` cmd/qu/ entry point internal/config/ on-disk file layout, ClusterConfig, NodeConfig internal/crypto/ RSA keypair + self-signed cert + SPKI fingerprints internal/trust/ fingerprint trust store internal/transport/ mTLS listener/dialer, framed JSON-RPC internal/quorum/ heartbeats + deterministic master election internal/replicate/ master-routed mutations, version-gated replication internal/checks/ HTTP/TCP/ICMP probers, scheduler, aggregator internal/alerts/ SMTP + Discord dispatchers, message rendering internal/daemon/ glue: wires every component + control socket internal/cli/ cobra commands, the user-facing surface internal/tui/ bubbletea dashboard (qu tui) ```