This commit is contained in:
+14
-4
@@ -6,12 +6,22 @@ this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
|
|||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
|
||||||
|
- **Master election cooldown (2 min).** A returning peer with a
|
||||||
|
lower NodeID no longer reclaims master the instant it reappears.
|
||||||
|
It must stay continuously live for `DefaultMasterCooldown`
|
||||||
|
(2 minutes) before displacing the incumbent. Bootstrap and
|
||||||
|
quorum-regained-from-empty still elect immediately; the cooldown
|
||||||
|
only protects an active incumbent. Fixes #3: a self-monitoring
|
||||||
|
master (TCP check on its own `:9901`) would otherwise flap the
|
||||||
|
role in lock-step with its own restart.
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
||||||
- #3 Master going up in the same window as service going up moves unknown -> online to ignore alert
|
- #1 Previously up services are alerted as going back up if the master goes down.
|
||||||
Added a cooldown to the master election process.
|
Ignore `unknown` -> `up` transitions during master election; still
|
||||||
- #1 Previously up services are alerted as going back up if the master goes down
|
alert on `unknown` -> `down` by design.
|
||||||
Ignore `unknown` -> `online` transitions during master election cooldown.
|
|
||||||
|
|
||||||
## [v0.0.2] — 2026-05-15
|
## [v0.0.2] — 2026-05-15
|
||||||
|
|
||||||
|
|||||||
@@ -94,7 +94,11 @@ the hysteresis that absorbs network blips.
|
|||||||
|
|
||||||
Master election is deterministic: among the live members of the quorum,
|
Master election is deterministic: among the live members of the quorum,
|
||||||
the node with the lexicographically smallest NodeID wins. No
|
the node with the lexicographically smallest NodeID wins. No
|
||||||
negotiation, no split-brain window.
|
negotiation, no split-brain window. A 2-minute **master cooldown**
|
||||||
|
keeps the current master in place until a returning lower-NodeID peer
|
||||||
|
has been continuously live for the full window, so a self-monitoring
|
||||||
|
master that briefly drops doesn't flap the role back the instant it
|
||||||
|
reappears.
|
||||||
|
|
||||||
`cluster.yaml` is the single replicated source of truth (peers, checks,
|
`cluster.yaml` is the single replicated source of truth (peers, checks,
|
||||||
alerts). Mutations from the CLI route through the master, which bumps a
|
alerts). Mutations from the CLI route through the master, which bumps a
|
||||||
|
|||||||
@@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected
|
|||||||
master changes (including transitions to and from "no master"). Use it
|
master changes (including transitions to and from "no master"). Use it
|
||||||
to spot flappy clusters.
|
to spot flappy clusters.
|
||||||
|
|
||||||
|
### Master cooldown
|
||||||
|
|
||||||
|
The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
|
||||||
|
primary master is also being monitored by `qu` itself (a TCP check on
|
||||||
|
its own `:9901`, say), a brief restart causes a master flap *and* a
|
||||||
|
state flap in lock-step. The new master sees the old master come back
|
||||||
|
on the next tick and immediately hands the role back, taking the
|
||||||
|
just-recovering node from `unknown` to `up` with no quiet period.
|
||||||
|
|
||||||
|
To absorb that, the quorum manager applies a **master cooldown**
|
||||||
|
(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
|
||||||
|
may displace the incumbent. The rules:
|
||||||
|
|
||||||
|
- The cooldown timer starts on the **first heartbeat after a
|
||||||
|
dead-after gap** — i.e. when a peer re-enters the live set after
|
||||||
|
having aged out. Continuous heartbeats never restart it.
|
||||||
|
- A flap during the cooldown resets the timer; the returning peer
|
||||||
|
must clear a full fresh window before taking over.
|
||||||
|
- The cooldown applies **only when an incumbent master exists**.
|
||||||
|
Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
|
||||||
|
live peer immediately, because there is no role to protect.
|
||||||
|
- If the incumbent drops out of the live set, the cooldown is
|
||||||
|
irrelevant — any live peer may take over without waiting.
|
||||||
|
|
||||||
|
The constant lives in `internal/quorum/manager.go`. Lower it for
|
||||||
|
faster fail-back at the cost of monitoring-self flap risk; raise it
|
||||||
|
to give a recovering master longer to settle before reclaiming the
|
||||||
|
role.
|
||||||
|
|
||||||
## Catch-up when a node reconnects
|
## Catch-up when a node reconnects
|
||||||
|
|
||||||
This is the scenario most people ask about: node C is offline, the
|
This is the scenario most people ask about: node C is offline, the
|
||||||
|
|||||||
@@ -183,6 +183,7 @@ Options:
|
|||||||
| `quorum` | `true` | `false` — no mutations, no alerts. |
|
| `quorum` | `true` | `false` — no mutations, no alerts. |
|
||||||
| `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. |
|
| `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. |
|
||||||
| `term` | slow growth | rapid growth → master flapping, network unstable. |
|
| `term` | slow growth | rapid growth → master flapping, network unstable. |
|
||||||
|
| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
|
||||||
| `config ver` | identical across nodes | divergence → a node is stuck pulling. |
|
| `config ver` | identical across nodes | divergence → a node is stuck pulling. |
|
||||||
|
|
||||||
A simple cron sentinel on each node:
|
A simple cron sentinel on each node:
|
||||||
|
|||||||
@@ -35,6 +35,25 @@ flapping. Causes:
|
|||||||
- Heartbeat timeouts (default 4s) are too tight for your inter-node
|
- Heartbeat timeouts (default 4s) are too tight for your inter-node
|
||||||
link. Rebuild with a higher `DefaultDeadAfter` if you need it.
|
link. Rebuild with a higher `DefaultDeadAfter` if you need it.
|
||||||
|
|
||||||
|
## Primary master came back but the cluster hasn't switched to it
|
||||||
|
|
||||||
|
**What it means.** Working as designed. After a returning peer with a
|
||||||
|
lower NodeID rejoins, the quorum manager waits
|
||||||
|
`DefaultMasterCooldown` (2 minutes) before letting it displace the
|
||||||
|
incumbent. The window prevents a self-monitoring master from flapping
|
||||||
|
the role in lock-step with its own restart.
|
||||||
|
|
||||||
|
How to confirm:
|
||||||
|
|
||||||
|
- `qu status` on every node shows the same (current) master and a
|
||||||
|
steady `term` — not flapping. The lower-NodeID peer is in the live
|
||||||
|
set but not yet master.
|
||||||
|
- After ~2 minutes of continuous liveness, `term` bumps once and the
|
||||||
|
master switches to the lower-NodeID peer.
|
||||||
|
|
||||||
|
If you need a different window, change `DefaultMasterCooldown` in
|
||||||
|
`internal/quorum/manager.go` and rebuild.
|
||||||
|
|
||||||
## A check is stuck in `unknown`
|
## A check is stuck in `unknown`
|
||||||
|
|
||||||
**What it means.** The aggregator has no fresh reports for that check.
|
**What it means.** The aggregator has no fresh reports for that check.
|
||||||
|
|||||||
Reference in New Issue
Block a user