diff --git a/CHANGELOG.md b/CHANGELOG.md index 055b761..18f423c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,12 +6,22 @@ this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm ## [Unreleased] -### Fixed +### Changed -- #3 Master going up in the same window as service going up moves unknown -> online to ignore alert -Added a cooldown to the master election process. -- #1 Previously up services are alerted as going back up if the master goes down -Ignore `unknown` -> `online` transitions during master election cooldown. +- **Master election cooldown (2 min).** A returning peer with a + lower NodeID no longer reclaims master the instant it reappears. + It must stay continuously live for `DefaultMasterCooldown` + (2 minutes) before displacing the incumbent. Bootstrap and + quorum-regained-from-empty still elect immediately; the cooldown + only protects an active incumbent. Fixes #3: a self-monitoring + master (TCP check on its own `:9901`) would otherwise flap the + role in lock-step with its own restart. + +### Fixed + +- #1 Previously up services are alerted as going back up if the master goes down. + Ignore `unknown` -> `up` transitions during master election; still + alert on `unknown` -> `down` by design. ## [v0.0.2] — 2026-05-15 diff --git a/README.md b/README.md index 6e7db78..0b4f4e7 100644 --- a/README.md +++ b/README.md @@ -94,7 +94,11 @@ the hysteresis that absorbs network blips. Master election is deterministic: among the live members of the quorum, the node with the lexicographically smallest NodeID wins. No -negotiation, no split-brain window. +negotiation, no split-brain window. A 2-minute **master cooldown** +keeps the current master in place until a returning lower-NodeID peer +has been continuously live for the full window, so a self-monitoring +master that briefly drops doesn't flap the role back the instant it +reappears. `cluster.yaml` is the single replicated source of truth (peers, checks, alerts). Mutations from the CLI route through the master, which bumps a diff --git a/docs/architecture.md b/docs/architecture.md index 84f6a5f..0414aa6 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected master changes (including transitions to and from "no master"). Use it to spot flappy clusters. +### Master cooldown + +The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the +primary master is also being monitored by `qu` itself (a TCP check on +its own `:9901`, say), a brief restart causes a master flap *and* a +state flap in lock-step. The new master sees the old master come back +on the next tick and immediately hands the role back, taking the +just-recovering node from `unknown` to `up` with no quiet period. + +To absorb that, the quorum manager applies a **master cooldown** +(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID +may displace the incumbent. The rules: + +- The cooldown timer starts on the **first heartbeat after a + dead-after gap** — i.e. when a peer re-enters the live set after + having aged out. Continuous heartbeats never restart it. +- A flap during the cooldown resets the timer; the returning peer + must clear a full fresh window before taking over. +- The cooldown applies **only when an incumbent master exists**. + Bootstrap and quorum-regained-from-empty elect the lowest-NodeID + live peer immediately, because there is no role to protect. +- If the incumbent drops out of the live set, the cooldown is + irrelevant — any live peer may take over without waiting. + +The constant lives in `internal/quorum/manager.go`. Lower it for +faster fail-back at the cost of monitoring-self flap risk; raise it +to give a recovering master longer to settle before reclaiming the +role. + ## Catch-up when a node reconnects This is the scenario most people ask about: node C is offline, the diff --git a/docs/operations.md b/docs/operations.md index 185c4db..c57daec 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -183,6 +183,7 @@ Options: | `quorum` | `true` | `false` — no mutations, no alerts. | | `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. | | `term` | slow growth | rapid growth → master flapping, network unstable. | +| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. | | `config ver` | identical across nodes | divergence → a node is stuck pulling. | A simple cron sentinel on each node: diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index f9025ab..e4a72cd 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -35,6 +35,25 @@ flapping. Causes: - Heartbeat timeouts (default 4s) are too tight for your inter-node link. Rebuild with a higher `DefaultDeadAfter` if you need it. +## Primary master came back but the cluster hasn't switched to it + +**What it means.** Working as designed. After a returning peer with a +lower NodeID rejoins, the quorum manager waits +`DefaultMasterCooldown` (2 minutes) before letting it displace the +incumbent. The window prevents a self-monitoring master from flapping +the role in lock-step with its own restart. + +How to confirm: + +- `qu status` on every node shows the same (current) master and a + steady `term` — not flapping. The lower-NodeID peer is in the live + set but not yet master. +- After ~2 minutes of continuous liveness, `term` bumps once and the + master switches to the lower-NodeID peer. + +If you need a different window, change `DefaultMasterCooldown` in +`internal/quorum/manager.go` and rebuild. + ## A check is stuck in `unknown` **What it means.** The aggregator has no fresh reports for that check.