Files
QUptime/docs/architecture.md
Axodouble 1e2e382867
Container image / image (push) Successful in 1m40s
Updated docs, readme, & changelog
2026-05-15 07:36:01 +00:00

10 KiB
Raw Permalink Blame History

Architecture

This page is the long-form companion to the diagram in the top-level README. Read it if you need to reason about partitions, recovery, upgrade ordering, or the consistency guarantees of qu.

Components

A running qu serve is one process containing five long-lived goroutines plus the listeners:

Component Package Role
Transport internal/transport mTLS listener + dialer, length-prefixed JSON-RPC framing.
Quorum manager internal/quorum 1 Hz heartbeats, liveness tracking, deterministic master election.
Replicator internal/replicate Master-routed mutations, version-gated broadcast and pull.
Scheduler internal/checks One goroutine per check; runs HTTP/TCP/ICMP probes on each node.
Aggregator internal/checks Master-only. Folds per-node probe results into a cluster-wide verdict.
Alert dispatch internal/alerts Master-only. Renders templates and ships SMTP / Discord notifications.
Control socket internal/daemon Local-only unix socket; the CLI and TUI talk to the daemon through it.

Every node runs every component. Whether the master-only ones actually do anything depends on the result of master election.

Trust and transport

Inter-node traffic is TLS 1.3 with mutual authentication. There is no central CA. Each node generates a self-signed RSA cert at qu init and the SPKI fingerprint of that cert is what other nodes pin against.

Two layers gate access:

  1. TLS layer accepts any client cert. This avoids a chicken-and-egg during bootstrap — a brand-new node has no entry in anyone's trust store yet, so a strict TLS check would refuse the very first handshake.
  2. RPC dispatcher rejects every method except Join for callers whose presented fingerprint is not in trust.yaml. So an untrusted peer can knock on the door but cannot ask questions.

Join itself is gated by the cluster secret — a pre-shared base64 string generated at qu init on the first node. Without it, an attacker who can reach :9901 cannot enrol themselves into the cluster.

The local CLI talks to the daemon over a unix socket with 0600 permissions; filesystem ACLs are the only authentication and no TLS is used on that channel.

The replicated state machine

cluster.yaml is the single replicated source of truth. It holds three editable lists — peers, checks, alerts — plus three server-controlled fields:

version: 7                 # monotonically increasing
updated_at: 2026-05-15T...
updated_by: <node-id>      # master that committed this version
peers:  [...]
checks: [...]
alerts: [...]

How mutations flow

  1. The CLI (or the manual-edit watcher; see below) issues a mutation on the local daemon's control socket.
  2. The daemon's replicator looks at the current quorum view:
    • If there is no quorum, the mutation fails loudly with no quorum: refusing mutation.
    • If this node is the master, apply locally and broadcast.
    • Otherwise, ship the mutation to the master via the ProposeMutation RPC and wait for the result.
  3. The master holds the cluster lock, applies the mutation, bumps version, writes cluster.yaml atomically, and broadcasts the new snapshot to every peer via ApplyClusterCfg.
  4. Each follower's Replace accepts the snapshot only if incoming.Version > local.Version. Older or equal versions are dropped silently.

The mutation kinds are enumerated in internal/transport/messages.go: add_check, remove_check, add_alert, remove_alert, add_peer, remove_peer, replace_config.

Manual edits to cluster.yaml

Operators can sudoedit /etc/quptime/cluster.yaml on any node. Every 2 seconds the daemon hashes the file. When the on-disk hash diverges from the last hash the daemon wrote, the new content is parsed and forwarded to the master as a replace_config mutation. So a hand-edit on a follower still ends up on the master, version-bumped, and broadcast everywhere.

If the parse fails (invalid YAML), the daemon logs and pins the bad hash so it doesn't loop. The operator's next valid save unblocks it.

Quorum and master election

Every node sends a heartbeat to every peer once per second. A peer is live if a heartbeat (sent or received) was observed within the last 4 seconds — comfortably more than three missed beats so a one-tick blip does not unseat the master.

Quorum is met when len(live_peers) >= floor(N/2) + 1 where N is the total peer count in cluster.yaml. Below quorum, the cluster refuses every mutation; existing checks continue probing locally but no state transitions are committed (the master is the only one who aggregates, and there is no master).

Master election is deterministic with no negotiation step: among the live members, the master is the one with the lexicographically smallest NodeID. Every node that observes the same live set picks the same master — so there is no split-brain window even during a partial partition.

The term integer in qu status is bumped every time the elected master changes (including transitions to and from "no master"). Use it to spot flappy clusters.

Master cooldown

The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the primary master is also being monitored by qu itself (a TCP check on its own :9901, say), a brief restart causes a master flap and a state flap in lock-step. The new master sees the old master come back on the next tick and immediately hands the role back, taking the just-recovering node from unknown to up with no quiet period.

To absorb that, the quorum manager applies a master cooldown (DefaultMasterCooldown, 2 minutes) before a peer with a lower NodeID may displace the incumbent. The rules:

  • The cooldown timer starts on the first heartbeat after a dead-after gap — i.e. when a peer re-enters the live set after having aged out. Continuous heartbeats never restart it.
  • A flap during the cooldown resets the timer; the returning peer must clear a full fresh window before taking over.
  • The cooldown applies only when an incumbent master exists. Bootstrap and quorum-regained-from-empty elect the lowest-NodeID live peer immediately, because there is no role to protect.
  • If the incumbent drops out of the live set, the cooldown is irrelevant — any live peer may take over without waiting.

The constant lives in internal/quorum/manager.go. Lower it for faster fail-back at the cost of monitoring-self flap risk; raise it to give a recovering master longer to settle before reclaiming the role.

Catch-up when a node reconnects

This is the scenario most people ask about: node C is offline, the master commits config version 7, node C comes back online. What happens?

  1. Node C's tick loop fires heartbeats every second regardless of its previous state. There is no backoff, no give-up.
  2. Each heartbeat carries the sender's Version. Each response carries the responder's Version.
  3. The first time C sees a peer reporting a higher version than its own, the version-observer fires and calls replicator.PullFrom(peerID, addr).
  4. PullFrom does a GetClusterCfg RPC against that peer and feeds the snapshot through Replace, which writes cluster.yaml atomically and refreshes the on-disk hash so the manual-edit watcher doesn't re-fire.
  5. Within ~1 heartbeat C is byte-for-byte identical to the master.

The same path catches a stale node up when the partition heals on the minority side: the minority side cannot mutate, so when it rejoins it strictly has the older version, and the pull fires.

There is one corner case worth knowing about: the pull only fires when peer_version > local_version. Two nodes at the same version with different content would silently diverge — but the design forbids that (only the master mutates, and the master is the only one bumping the version) unless somebody hand-edits cluster.yaml and also manually sets version:. Don't do that.

Why a check flips state

The aggregator runs on the master only. Followers' probe results are shipped to the master via the ReportResult RPC; the master's own probe results are submitted directly.

For each check, the aggregator keeps the latest result per node within a freshness window (3× the check interval, minimum 30s). On each incoming submission it counts OK vs not-OK across the fresh results:

  • 0 fresh reports → unknown
  • more OK than not-OK → up
  • more not-OK than OK → down
  • tie → up (a tie at one report means one node says yes and one says no; biasing toward up avoids false alerts when nodes disagree transiently).

A state flip is not committed immediately. Hysteresis requires the candidate state to hold for two consecutive aggregate evaluations before the state transition fires and the alert dispatcher is called. Set in internal/checks/aggregator.go as the HysteresisCount constant — change it there if you want a hair-trigger or a slower alert.

If the master changes, the new master starts the per-check state from unknown and rebuilds it as fresh results arrive. The first few seconds after a re-election can therefore show unknown even for checks that were up a moment ago.

What qu does not do

These omissions are intentional in v1 and useful to know up front:

  • No persistent history. Only the current aggregate state lives in memory. There are no graphs, no SLA reports. Add a sidecar (Prometheus exporter, SQLite logger) if you need them.
  • No automatic key rotation. Re-init a node and re-trust if you need to roll its identity. See security.md.
  • No multi-tenant isolation. One cluster = one set of checks = one alert tree.
  • No web UI. Operator surface is qu (CLI), qu tui, and direct edits to cluster.yaml.
  • No automatic peer eviction on prolonged downtime. A dead peer stays in cluster.yaml until an operator runs qu node remove, because that decision affects the quorum size and shouldn't happen silently.