226 lines
10 KiB
Markdown
226 lines
10 KiB
Markdown
# Architecture
|
||
|
||
This page is the long-form companion to the diagram in the top-level
|
||
README. Read it if you need to reason about partitions, recovery,
|
||
upgrade ordering, or the consistency guarantees of `qu`.
|
||
|
||
## Components
|
||
|
||
A running `qu serve` is one process containing five long-lived
|
||
goroutines plus the listeners:
|
||
|
||
| Component | Package | Role |
|
||
| --------------- | ------------------------ | ------------------------------------------------------------------------ |
|
||
| Transport | `internal/transport` | mTLS listener + dialer, length-prefixed JSON-RPC framing. |
|
||
| Quorum manager | `internal/quorum` | 1 Hz heartbeats, liveness tracking, deterministic master election. |
|
||
| Replicator | `internal/replicate` | Master-routed mutations, version-gated broadcast and pull. |
|
||
| Scheduler | `internal/checks` | One goroutine per check; runs HTTP/TCP/ICMP probes on each node. |
|
||
| Aggregator | `internal/checks` | Master-only. Folds per-node probe results into a cluster-wide verdict. |
|
||
| Alert dispatch | `internal/alerts` | Master-only. Renders templates and ships SMTP / Discord notifications. |
|
||
| Control socket | `internal/daemon` | Local-only unix socket; the CLI and TUI talk to the daemon through it. |
|
||
|
||
Every node runs every component. Whether the master-only ones actually
|
||
*do* anything depends on the result of master election.
|
||
|
||
## Trust and transport
|
||
|
||
Inter-node traffic is TLS 1.3 with mutual authentication. There is **no
|
||
central CA**. Each node generates a self-signed RSA cert at `qu init`
|
||
and the SPKI fingerprint of that cert is what other nodes pin against.
|
||
|
||
Two layers gate access:
|
||
|
||
1. **TLS layer** accepts any client cert. This avoids a chicken-and-egg
|
||
during bootstrap — a brand-new node has no entry in anyone's trust
|
||
store yet, so a strict TLS check would refuse the very first
|
||
handshake.
|
||
2. **RPC dispatcher** rejects every method except `Join` for callers
|
||
whose presented fingerprint is not in `trust.yaml`. So an untrusted
|
||
peer can knock on the door but cannot ask questions.
|
||
|
||
`Join` itself is gated by the **cluster secret** — a pre-shared base64
|
||
string generated at `qu init` on the first node. Without it, an
|
||
attacker who can reach `:9901` cannot enrol themselves into the
|
||
cluster.
|
||
|
||
The local CLI talks to the daemon over a unix socket with `0600`
|
||
permissions; filesystem ACLs are the only authentication and no TLS is
|
||
used on that channel.
|
||
|
||
## The replicated state machine
|
||
|
||
`cluster.yaml` is the single replicated source of truth. It holds three
|
||
editable lists — `peers`, `checks`, `alerts` — plus three
|
||
server-controlled fields:
|
||
|
||
```yaml
|
||
version: 7 # monotonically increasing
|
||
updated_at: 2026-05-15T...
|
||
updated_by: <node-id> # master that committed this version
|
||
peers: [...]
|
||
checks: [...]
|
||
alerts: [...]
|
||
```
|
||
|
||
### How mutations flow
|
||
|
||
1. The CLI (or the manual-edit watcher; see below) issues a mutation
|
||
on the local daemon's control socket.
|
||
2. The daemon's replicator looks at the current quorum view:
|
||
- If there is no quorum, the mutation fails loudly with
|
||
`no quorum: refusing mutation`.
|
||
- If this node is the master, apply locally and broadcast.
|
||
- Otherwise, ship the mutation to the master via the
|
||
`ProposeMutation` RPC and wait for the result.
|
||
3. The master holds the cluster lock, applies the mutation, bumps
|
||
`version`, writes `cluster.yaml` atomically, and broadcasts the new
|
||
snapshot to every peer via `ApplyClusterCfg`.
|
||
4. Each follower's `Replace` accepts the snapshot **only if**
|
||
`incoming.Version > local.Version`. Older or equal versions are
|
||
dropped silently.
|
||
|
||
The mutation kinds are enumerated in `internal/transport/messages.go`:
|
||
`add_check`, `remove_check`, `add_alert`, `remove_alert`, `add_peer`,
|
||
`remove_peer`, `replace_config`.
|
||
|
||
### Manual edits to `cluster.yaml`
|
||
|
||
Operators can `sudoedit /etc/quptime/cluster.yaml` on any node. Every
|
||
2 seconds the daemon hashes the file. When the on-disk hash diverges
|
||
from the last hash the daemon wrote, the new content is parsed and
|
||
forwarded to the master as a `replace_config` mutation. So a hand-edit
|
||
on a follower still ends up on the master, version-bumped, and
|
||
broadcast everywhere.
|
||
|
||
If the parse fails (invalid YAML), the daemon logs and pins the bad
|
||
hash so it doesn't loop. The operator's next valid save unblocks it.
|
||
|
||
## Quorum and master election
|
||
|
||
Every node sends a heartbeat to every peer once per second. A peer is
|
||
**live** if a heartbeat (sent or received) was observed within the
|
||
last 4 seconds — comfortably more than three missed beats so a one-tick
|
||
blip does not unseat the master.
|
||
|
||
**Quorum** is met when `len(live_peers) >= floor(N/2) + 1` where `N`
|
||
is the total peer count in `cluster.yaml`. Below quorum, the cluster
|
||
refuses every mutation; existing checks continue probing locally but no
|
||
state transitions are committed (the master is the only one who
|
||
aggregates, and there is no master).
|
||
|
||
**Master election** is deterministic with no negotiation step: among
|
||
the live members, the master is the one with the lexicographically
|
||
smallest `NodeID`. Every node that observes the same live set picks the
|
||
same master — so there is no split-brain window even during a partial
|
||
partition.
|
||
|
||
The `term` integer in `qu status` is bumped every time the elected
|
||
master changes (including transitions to and from "no master"). Use it
|
||
to spot flappy clusters.
|
||
|
||
### Master cooldown
|
||
|
||
The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
|
||
primary master is also being monitored by `qu` itself (a TCP check on
|
||
its own `:9901`, say), a brief restart causes a master flap *and* a
|
||
state flap in lock-step. The new master sees the old master come back
|
||
on the next tick and immediately hands the role back, taking the
|
||
just-recovering node from `unknown` to `up` with no quiet period.
|
||
|
||
To absorb that, the quorum manager applies a **master cooldown**
|
||
(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
|
||
may displace the incumbent. The rules:
|
||
|
||
- The cooldown timer starts on the **first heartbeat after a
|
||
dead-after gap** — i.e. when a peer re-enters the live set after
|
||
having aged out. Continuous heartbeats never restart it.
|
||
- A flap during the cooldown resets the timer; the returning peer
|
||
must clear a full fresh window before taking over.
|
||
- The cooldown applies **only when an incumbent master exists**.
|
||
Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
|
||
live peer immediately, because there is no role to protect.
|
||
- If the incumbent drops out of the live set, the cooldown is
|
||
irrelevant — any live peer may take over without waiting.
|
||
|
||
The constant lives in `internal/quorum/manager.go`. Lower it for
|
||
faster fail-back at the cost of monitoring-self flap risk; raise it
|
||
to give a recovering master longer to settle before reclaiming the
|
||
role.
|
||
|
||
## Catch-up when a node reconnects
|
||
|
||
This is the scenario most people ask about: node C is offline, the
|
||
master commits config version 7, node C comes back online. What
|
||
happens?
|
||
|
||
1. Node C's tick loop fires heartbeats every second regardless of its
|
||
previous state. There is no backoff, no give-up.
|
||
2. Each heartbeat carries the sender's `Version`. Each response carries
|
||
the responder's `Version`.
|
||
3. The first time C sees a peer reporting a higher version than its
|
||
own, the version-observer fires and calls
|
||
`replicator.PullFrom(peerID, addr)`.
|
||
4. `PullFrom` does a `GetClusterCfg` RPC against that peer and feeds
|
||
the snapshot through `Replace`, which writes `cluster.yaml`
|
||
atomically and refreshes the on-disk hash so the manual-edit
|
||
watcher doesn't re-fire.
|
||
5. Within ~1 heartbeat C is byte-for-byte identical to the master.
|
||
|
||
The same path catches a stale node up when the partition heals on the
|
||
minority side: the minority side cannot mutate, so when it rejoins it
|
||
strictly has the older version, and the pull fires.
|
||
|
||
There is one corner case worth knowing about: the pull only fires when
|
||
`peer_version > local_version`. Two nodes at the same version with
|
||
different content would silently diverge — but the design forbids
|
||
that (only the master mutates, and the master is the only one bumping
|
||
the version) unless somebody hand-edits `cluster.yaml` and also
|
||
manually sets `version:`. Don't do that.
|
||
|
||
## Why a check flips state
|
||
|
||
The aggregator runs on the master only. Followers' probe results are
|
||
shipped to the master via the `ReportResult` RPC; the master's own
|
||
probe results are submitted directly.
|
||
|
||
For each check, the aggregator keeps the latest result per node within
|
||
a freshness window (3× the check interval, minimum 30s). On each
|
||
incoming submission it counts OK vs not-OK across the fresh results:
|
||
|
||
- 0 fresh reports → `unknown`
|
||
- more OK than not-OK → `up`
|
||
- more not-OK than OK → `down`
|
||
- tie → `up` (a tie at one report means one node says yes and one says
|
||
no; biasing toward `up` avoids false alerts when nodes disagree
|
||
transiently).
|
||
|
||
A state flip is **not** committed immediately. Hysteresis requires the
|
||
candidate state to hold for **two consecutive aggregate evaluations**
|
||
before the state transition fires and the alert dispatcher is called.
|
||
Set in `internal/checks/aggregator.go` as the `HysteresisCount`
|
||
constant — change it there if you want a hair-trigger or a slower
|
||
alert.
|
||
|
||
If the master changes, the new master starts the per-check state from
|
||
`unknown` and rebuilds it as fresh results arrive. The first few
|
||
seconds after a re-election can therefore show `unknown` even for
|
||
checks that were `up` a moment ago.
|
||
|
||
## What `qu` does *not* do
|
||
|
||
These omissions are intentional in v1 and useful to know up front:
|
||
|
||
- **No persistent history.** Only the current aggregate state lives in
|
||
memory. There are no graphs, no SLA reports. Add a sidecar (Prometheus
|
||
exporter, SQLite logger) if you need them.
|
||
- **No automatic key rotation.** Re-init a node and re-trust if you
|
||
need to roll its identity. See [security.md](security.md).
|
||
- **No multi-tenant isolation.** One cluster = one set of checks =
|
||
one alert tree.
|
||
- **No web UI.** Operator surface is `qu` (CLI), `qu tui`, and direct
|
||
edits to `cluster.yaml`.
|
||
- **No automatic peer eviction on prolonged downtime.** A dead peer
|
||
stays in `cluster.yaml` until an operator runs `qu node remove`,
|
||
because that decision affects the quorum size and shouldn't happen
|
||
silently.
|