Updated docs, readme, & changelog

2026-05-15 07:36:01 +00:00
parent ed25e9ed68
commit 1e2e382867
5 changed files with 69 additions and 6 deletions
@@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected
 master changes (including transitions to and from "no master"). Use it
 to spot flappy clusters.

+### Master cooldown
+
+The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
+primary master is also being monitored by `qu` itself (a TCP check on
+its own `:9901`, say), a brief restart causes a master flap *and* a
+state flap in lock-step. The new master sees the old master come back
+on the next tick and immediately hands the role back, taking the
+just-recovering node from `unknown` to `up` with no quiet period.
+
+To absorb that, the quorum manager applies a **master cooldown**
+(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
+may displace the incumbent. The rules:
+
+- The cooldown timer starts on the **first heartbeat after a
+  dead-after gap** — i.e. when a peer re-enters the live set after
+  having aged out. Continuous heartbeats never restart it.
+- A flap during the cooldown resets the timer; the returning peer
+  must clear a full fresh window before taking over.
+- The cooldown applies **only when an incumbent master exists**.
+  Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
+  live peer immediately, because there is no role to protect.
+- If the incumbent drops out of the live set, the cooldown is
+  irrelevant — any live peer may take over without waiting.
+
+The constant lives in `internal/quorum/manager.go`. Lower it for
+faster fail-back at the cost of monitoring-self flap risk; raise it
+to give a recovering master longer to settle before reclaiming the
+role.
+
 ## Catch-up when a node reconnects

 This is the scenario most people ask about: node C is offline, the
@@ -183,6 +183,7 @@ Options:
 | `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
 | `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
 | `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
+| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
 | `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |

 A simple cron sentinel on each node:
@@ -35,6 +35,25 @@ flapping. Causes:
 - Heartbeat timeouts (default 4s) are too tight for your inter-node
  link. Rebuild with a higher `DefaultDeadAfter` if you need it.

+## Primary master came back but the cluster hasn't switched to it
+
+**What it means.** Working as designed. After a returning peer with a
+lower NodeID rejoins, the quorum manager waits
+`DefaultMasterCooldown` (2 minutes) before letting it displace the
+incumbent. The window prevents a self-monitoring master from flapping
+the role in lock-step with its own restart.
+
+How to confirm:
+
+- `qu status` on every node shows the same (current) master and a
+  steady `term` — not flapping. The lower-NodeID peer is in the live
+  set but not yet master.
+- After ~2 minutes of continuous liveness, `term` bumps once and the
+  master switches to the lower-NodeID peer.
+
+If you need a different window, change `DefaultMasterCooldown` in
+`internal/quorum/manager.go` and rebuild.
+
 ## A check is stuck in `unknown`

 **What it means.** The aggregator has no fresh reports for that check.