Updated docs, readme, & changelog
Container image / image (push) Successful in 1m40s

This commit is contained in:
2026-05-15 07:36:01 +00:00
parent ed25e9ed68
commit 1e2e382867
5 changed files with 69 additions and 6 deletions
+15 -5
View File
@@ -6,12 +6,22 @@ this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
## [Unreleased] ## [Unreleased]
### Fixed ### Changed
- #3 Master going up in the same window as service going up moves unknown -> online to ignore alert - **Master election cooldown (2 min).** A returning peer with a
Added a cooldown to the master election process. lower NodeID no longer reclaims master the instant it reappears.
- #1 Previously up services are alerted as going back up if the master goes down It must stay continuously live for `DefaultMasterCooldown`
Ignore `unknown` -> `online` transitions during master election cooldown. (2 minutes) before displacing the incumbent. Bootstrap and
quorum-regained-from-empty still elect immediately; the cooldown
only protects an active incumbent. Fixes #3: a self-monitoring
master (TCP check on its own `:9901`) would otherwise flap the
role in lock-step with its own restart.
### Fixed
- #1 Previously up services are alerted as going back up if the master goes down.
Ignore `unknown` -> `up` transitions during master election; still
alert on `unknown` -> `down` by design.
## [v0.0.2] — 2026-05-15 ## [v0.0.2] — 2026-05-15
+5 -1
View File
@@ -94,7 +94,11 @@ the hysteresis that absorbs network blips.
Master election is deterministic: among the live members of the quorum, Master election is deterministic: among the live members of the quorum,
the node with the lexicographically smallest NodeID wins. No the node with the lexicographically smallest NodeID wins. No
negotiation, no split-brain window. negotiation, no split-brain window. A 2-minute **master cooldown**
keeps the current master in place until a returning lower-NodeID peer
has been continuously live for the full window, so a self-monitoring
master that briefly drops doesn't flap the role back the instant it
reappears.
`cluster.yaml` is the single replicated source of truth (peers, checks, `cluster.yaml` is the single replicated source of truth (peers, checks,
alerts). Mutations from the CLI route through the master, which bumps a alerts). Mutations from the CLI route through the master, which bumps a
+29
View File
@@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected
master changes (including transitions to and from "no master"). Use it master changes (including transitions to and from "no master"). Use it
to spot flappy clusters. to spot flappy clusters.
### Master cooldown
The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
primary master is also being monitored by `qu` itself (a TCP check on
its own `:9901`, say), a brief restart causes a master flap *and* a
state flap in lock-step. The new master sees the old master come back
on the next tick and immediately hands the role back, taking the
just-recovering node from `unknown` to `up` with no quiet period.
To absorb that, the quorum manager applies a **master cooldown**
(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
may displace the incumbent. The rules:
- The cooldown timer starts on the **first heartbeat after a
dead-after gap** — i.e. when a peer re-enters the live set after
having aged out. Continuous heartbeats never restart it.
- A flap during the cooldown resets the timer; the returning peer
must clear a full fresh window before taking over.
- The cooldown applies **only when an incumbent master exists**.
Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
live peer immediately, because there is no role to protect.
- If the incumbent drops out of the live set, the cooldown is
irrelevant — any live peer may take over without waiting.
The constant lives in `internal/quorum/manager.go`. Lower it for
faster fail-back at the cost of monitoring-self flap risk; raise it
to give a recovering master longer to settle before reclaiming the
role.
## Catch-up when a node reconnects ## Catch-up when a node reconnects
This is the scenario most people ask about: node C is offline, the This is the scenario most people ask about: node C is offline, the
+1
View File
@@ -183,6 +183,7 @@ Options:
| `quorum` | `true` | `false` — no mutations, no alerts. | | `quorum` | `true` | `false` — no mutations, no alerts. |
| `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. | | `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. |
| `term` | slow growth | rapid growth → master flapping, network unstable. | | `term` | slow growth | rapid growth → master flapping, network unstable. |
| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
| `config ver` | identical across nodes | divergence → a node is stuck pulling. | | `config ver` | identical across nodes | divergence → a node is stuck pulling. |
A simple cron sentinel on each node: A simple cron sentinel on each node:
+19
View File
@@ -35,6 +35,25 @@ flapping. Causes:
- Heartbeat timeouts (default 4s) are too tight for your inter-node - Heartbeat timeouts (default 4s) are too tight for your inter-node
link. Rebuild with a higher `DefaultDeadAfter` if you need it. link. Rebuild with a higher `DefaultDeadAfter` if you need it.
## Primary master came back but the cluster hasn't switched to it
**What it means.** Working as designed. After a returning peer with a
lower NodeID rejoins, the quorum manager waits
`DefaultMasterCooldown` (2 minutes) before letting it displace the
incumbent. The window prevents a self-monitoring master from flapping
the role in lock-step with its own restart.
How to confirm:
- `qu status` on every node shows the same (current) master and a
steady `term` — not flapping. The lower-NodeID peer is in the live
set but not yet master.
- After ~2 minutes of continuous liveness, `term` bumps once and the
master switches to the lower-NodeID peer.
If you need a different window, change `DefaultMasterCooldown` in
`internal/quorum/manager.go` and rebuild.
## A check is stuck in `unknown` ## A check is stuck in `unknown`
**What it means.** The aggregator has no fresh reports for that check. **What it means.** The aggregator has no fresh reports for that check.