Updated changelog for actual v0.1.0

Updated docs, readme, & changelog
Fix #3 by adding a cooldown to the master election process
2026-05-15 07:37:44 +00:00 · 2026-05-15 07:36:01 +00:00 · 2026-05-15 07:32:15 +00:00 · 2026-05-15 07:17:28 +00:00
9 changed files with 302 additions and 9 deletions
@@ -4,6 +4,25 @@ All notable changes to this project are documented here. The format
 follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and
 this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## [v0.1.0] — 2026-05-15
 ### Changed
 - **Master election cooldown (2 min).** A returning peer with a
  lower NodeID no longer reclaims master the instant it reappears.
  It must stay continuously live for `DefaultMasterCooldown`
  (2 minutes) before displacing the incumbent. Bootstrap and
  quorum-regained-from-empty still elect immediately; the cooldown
  only protects an active incumbent. Fixes #3: a self-monitoring
  master (TCP check on its own `:9901`) would otherwise flap the
  role in lock-step with its own restart.
 ### Fixed
 - #1 Previously up services are alerted as going back up if the master goes down.
  Ignore `unknown` -> `up` transitions during master election; still
  alert on `unknown` -> `down` by design.
 ## [v0.0.2] — 2026-05-15
 ### Fixed
@@ -93,3 +112,4 @@ Initial public release.
  Planned for a future release.
 [v0.0.1]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.0.1
 [v0.1.0]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.1.0
@@ -94,7 +94,11 @@ the hysteresis that absorbs network blips.
 Master election is deterministic: among the live members of the quorum,
 the node with the lexicographically smallest NodeID wins. No
-negotiation, no split-brain window.
+negotiation, no split-brain window. A 2-minute **master cooldown**
 keeps the current master in place until a returning lower-NodeID peer
 has been continuously live for the full window, so a self-monitoring
 master that briefly drops doesn't flap the role back the instant it
 reappears.
 `cluster.yaml` is the single replicated source of truth (peers, checks,
 alerts). Mutations from the CLI route through the master, which bumps a
@@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected
 master changes (including transitions to and from "no master"). Use it
 to spot flappy clusters.
 ### Master cooldown
 The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
 primary master is also being monitored by `qu` itself (a TCP check on
 its own `:9901`, say), a brief restart causes a master flap *and* a
 state flap in lock-step. The new master sees the old master come back
 on the next tick and immediately hands the role back, taking the
 just-recovering node from `unknown` to `up` with no quiet period.
 To absorb that, the quorum manager applies a **master cooldown**
 (`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
 may displace the incumbent. The rules:
 - The cooldown timer starts on the **first heartbeat after a
  dead-after gap** — i.e. when a peer re-enters the live set after
  having aged out. Continuous heartbeats never restart it.
 - A flap during the cooldown resets the timer; the returning peer
  must clear a full fresh window before taking over.
 - The cooldown applies **only when an incumbent master exists**.
  Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
  live peer immediately, because there is no role to protect.
 - If the incumbent drops out of the live set, the cooldown is
  irrelevant — any live peer may take over without waiting.
 The constant lives in `internal/quorum/manager.go`. Lower it for
 faster fail-back at the cost of monitoring-self flap risk; raise it
 to give a recovering master longer to settle before reclaiming the
 role.
 ## Catch-up when a node reconnects
 This is the scenario most people ask about: node C is offline, the
@@ -183,6 +183,7 @@ Options:
 | `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
 | `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
 | `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
 | `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
 | `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |
 A simple cron sentinel on each node:
@@ -35,6 +35,25 @@ flapping. Causes:
 - Heartbeat timeouts (default 4s) are too tight for your inter-node
  link. Rebuild with a higher `DefaultDeadAfter` if you need it.
 ## Primary master came back but the cluster hasn't switched to it
 **What it means.** Working as designed. After a returning peer with a
 lower NodeID rejoins, the quorum manager waits
 `DefaultMasterCooldown` (2 minutes) before letting it displace the
 incumbent. The window prevents a self-monitoring master from flapping
 the role in lock-step with its own restart.
 How to confirm:
 - `qu status` on every node shows the same (current) master and a
  steady `term` — not flapping. The lower-NodeID peer is in the live
  set but not yet master.
 - After ~2 minutes of continuous liveness, `term` bumps once and the
  master switches to the lower-NodeID peer.
 If you need a different window, change `DefaultMasterCooldown` in
 `internal/quorum/manager.go` and rebuild.
 ## A check is stuck in `unknown`
 **What it means.** The aggregator has no fresh reports for that check.
@@ -175,6 +175,21 @@ fi
 install -d -o "$SERVICE_USER" -g "$SERVICE_GROUP" -m 0750 "$DATA_DIR"
 # Reassert ownership on the dir's contents. Two cases this catches:
 #   - re-running the installer over a previous install where the
 #     service user/group changed
 #   - the operator ran `qu init` or `qu serve` as root once (easy
 #     mistake: `sudo qu init` is shorter than the documented
 #     `sudo -u quptime qu init`). When the daemon runs as root its
 #     DataDir() resolves to /etc/quptime, so any files it writes land
 #     here owned by root:root mode 0600 — the systemd service then
 #     fails with `open node.yaml: permission denied`.
 # chown -R only changes ownership, not perms, so file modes set by
 # the daemon (0600 for node.yaml, 0700 for keys/) are preserved.
 if [ -n "$(ls -A "$DATA_DIR" 2>/dev/null)" ]; then
    chown -R "$SERVICE_USER:$SERVICE_GROUP" "$DATA_DIR"
 fi
 echo "> writing $SERVICE_FILE"
 cat > "$SERVICE_FILE" <<'EOF'
 [Unit]
@@ -252,11 +267,18 @@ Next steps:
              # On follower nodes, also set the shared join secret:
              # Environment=QUPTIME_CLUSTER_SECRET=<paste from first node>
-       b) Or run \`qu init\` once explicitly:
+       b) Or run \`qu init\` once explicitly. IMPORTANT: run as the
          ${SERVICE_USER} user, not root — otherwise node.yaml lands
          owned by root and the service can't read it on start.
            sudo -u ${SERVICE_USER} QUPTIME_DIR=${DATA_DIR} \\
              qu init --advertise <this-host>:9901
          If you already ran it as root and the service is failing
          with "permission denied" on node.yaml, repair with:
            sudo chown -R ${SERVICE_USER}:${SERVICE_GROUP} ${DATA_DIR}
  2. Start the service:
       sudo systemctl start ${SERVICE_NAME}
@@ -16,6 +16,7 @@ import (
 	"errors"
 	"os"
 	"path/filepath"
 	"strings"
 )
 // Default file names. Callers should always go through DataDir() so an
@@ -55,10 +56,34 @@ func DataDir() string {
 }
 // SocketPath returns the unix socket used for local CLI ↔ daemon control.
 //
 // Resolution order:
 //  1. $QUPTIME_SOCKET — explicit operator override
 //  2. $RUNTIME_DIRECTORY — set by systemd when the unit declares
 //     RuntimeDirectory=quptime. This is the path that matters in
 //     practice: with User=quptime + PrivateTmp=true, the daemon's
 //     /tmp is namespaced and invisible to the root CLI shell, so a
 //     /tmp fallback yields "no such file" even though the daemon is
 //     happily listening. Anchoring on $RUNTIME_DIRECTORY puts the
 //     socket at /run/quptime/quptime.sock, which is the same inode
 //     the root-CLI default (/var/run/quptime/…) reaches via the
 //     /var/run → /run symlink.
 //  3. /var/run/quptime/… when euid is 0 (CLI side, packaged installs)
 //  4. $XDG_RUNTIME_DIR/quptime/… for user-mode installs
 //  5. /tmp/quptime-<user>/… as a last resort
 func SocketPath() string {
 	if v := os.Getenv("QUPTIME_SOCKET"); v != "" {
 		return v
 	}
 	if v := os.Getenv("RUNTIME_DIRECTORY"); v != "" {
 		// systemd may pass multiple colon-separated entries when more
 		// than one RuntimeDirectory= is declared. Ours is single, but
 		// be defensive in case a future unit adds more.
 		if i := strings.IndexByte(v, ':'); i >= 0 {
 			v = v[:i]
 		}
 		return filepath.Join(v, SocketName)
 	}
 	if os.Geteuid() == 0 {
 		return "/var/run/quptime/" + SocketName
 	}
@@ -34,6 +34,12 @@ import (
 const (
 	DefaultHeartbeatInterval = 1 * time.Second
 	DefaultDeadAfter         = 4 * time.Second
 	// DefaultMasterCooldown is the grace period a returning peer must
 	// stay continuously live before it's allowed to displace the
 	// currently-elected master. Without it, a self-monitoring master
 	// that briefly drops would reclaim the role immediately on return
 	// and disrupt anything watching its TCP port.
 	DefaultMasterCooldown = 2 * time.Minute
 )
 // VersionObserver is invoked whenever a heartbeat exchange reveals
@@ -50,11 +56,13 @@ type Manager struct {
 	heartbeatInterval time.Duration
 	deadAfter         time.Duration
 	masterCooldown    time.Duration
 	mu        sync.RWMutex
 	term      uint64
 	masterID  string
 	lastSeen  map[string]time.Time // peerID -> last contact (sent or recv)
 	liveSince map[string]time.Time // peerID -> start of current liveness streak
 	addrOf    map[string]string    // peerID -> advertise addr (last known)
 	observer VersionObserver
@@ -70,7 +78,9 @@ func New(selfID string, cluster *config.ClusterConfig, client *transport.Client)
 		client:            client,
 		heartbeatInterval: DefaultHeartbeatInterval,
 		deadAfter:         DefaultDeadAfter,
 		masterCooldown:    DefaultMasterCooldown,
 		lastSeen:          map[string]time.Time{},
 		liveSince:         map[string]time.Time{},
 		addrOf:            map[string]string{},
 	}
 }
@@ -242,7 +252,15 @@ func (m *Manager) tick(ctx context.Context) {
 func (m *Manager) markLive(id string) {
 	m.mu.Lock()
-	m.lastSeen[id] = time.Now()
+	now := time.Now()
 	prev, ok := m.lastSeen[id]
 	// A peer entering its first liveness streak — or returning after
 	// the dead-after window expired — resets liveSince. Subsequent
 	// heartbeats within the streak leave it untouched.
 	if !ok || now.Sub(prev) > m.deadAfter {
 		m.liveSince[id] = now
 	}
 	m.lastSeen[id] = now
 	m.mu.Unlock()
 }
@@ -276,7 +294,41 @@ func (m *Manager) recomputeMaster() {
 	var newMaster string
 	if len(live) >= quorum && len(live) > 0 {
-		newMaster = live[0] // lowest NodeID wins
+		// Without an incumbent the cluster is bootstrapping or
 		// has just regained quorum, so elect immediately — there's
 		// nothing to protect from a handoff.
 		if m.masterID == "" {
 			newMaster = live[0]
 		} else {
 			newMaster = m.masterID
 			now := time.Now()
 			incumbentLive := false
 			for _, id := range live {
 				if id == m.masterID {
 					incumbentLive = true
 					break
 				}
 			}
 			// If the incumbent is no longer live, any live peer
 			// may take over without waiting.
 			if !incumbentLive {
 				newMaster = live[0]
 			} else {
 				// Incumbent is live. A peer with a lower NodeID
 				// may only displace it after it has stayed
 				// continuously live for masterCooldown.
 				for _, id := range live {
 					if id >= m.masterID {
 						break // sorted ascending — nobody lower left
 					}
 					since, ok := m.liveSince[id]
 					if ok && now.Sub(since) >= m.masterCooldown {
 						newMaster = id
 						break
 					}
 				}
 			}
 		}
 	}
 	if newMaster != m.masterID {
 		m.term++
@@ -119,6 +119,127 @@ func TestDeadAfterEvictsStaleLiveness(t *testing.T) {
 	}
 }
 // heartbeatLoop simulates the production heartbeat cadence — calling
 // markLive for the given peers more frequently than deadAfter, so a
 // peer that's "live throughout" never has its liveSince reset by the
 // dead-after gap heuristic. It returns when the context's deadline
 // hits.
 func heartbeatLoop(t *testing.T, m *Manager, dur time.Duration, peers ...string) {
 	t.Helper()
 	deadline := time.Now().Add(dur)
 	interval := m.deadAfter / 4
 	if interval < time.Millisecond {
 		interval = time.Millisecond
 	}
 	for time.Now().Before(deadline) {
 		for _, p := range peers {
 			m.markLive(p)
 		}
 		m.recomputeMaster()
 		time.Sleep(interval)
 	}
 }
 func TestReturningLowerIDWaitsForCooldown(t *testing.T) {
 	_, m := threeNode("b")
 	m.deadAfter = 80 * time.Millisecond
 	m.masterCooldown = 200 * time.Millisecond
 	// Bootstrap: all three live, "a" elected.
 	m.markLive("a")
 	m.markLive("b")
 	m.markLive("c")
 	m.recomputeMaster()
 	if m.Master() != "a" {
 		t.Fatalf("initial master=%q want a", m.Master())
 	}
 	// "a" drops — only b/c heartbeat. Long enough to age a out and let
 	// b take over.
 	heartbeatLoop(t, m, 120*time.Millisecond, "b", "c")
 	if m.Master() != "b" {
 		t.Fatalf("after a-drop master=%q want b", m.Master())
 	}
 	// "a" returns. Verify b stays master for less than the cooldown.
 	heartbeatLoop(t, m, 120*time.Millisecond, "a", "b", "c")
 	if m.Master() != "b" {
 		t.Errorf("mid-cooldown master=%q want b", m.Master())
 	}
 	// Past the cooldown, a reclaims master.
 	heartbeatLoop(t, m, 120*time.Millisecond, "a", "b", "c")
 	if m.Master() != "a" {
 		t.Errorf("after cooldown master=%q want a", m.Master())
 	}
 }
 func TestCooldownResetsOnFlap(t *testing.T) {
 	_, m := threeNode("b")
 	m.deadAfter = 80 * time.Millisecond
 	m.masterCooldown = 200 * time.Millisecond
 	m.markLive("a")
 	m.markLive("b")
 	m.markLive("c")
 	m.recomputeMaster()
 	// a drops, b becomes master.
 	heartbeatLoop(t, m, 120*time.Millisecond, "b", "c")
 	if m.Master() != "b" {
 		t.Fatalf("master=%q want b", m.Master())
 	}
 	// a returns briefly, then drops again before cooldown elapses.
 	heartbeatLoop(t, m, 100*time.Millisecond, "a", "b", "c")
 	if m.Master() != "b" {
 		t.Fatalf("during first cooldown master=%q want b", m.Master())
 	}
 	heartbeatLoop(t, m, 120*time.Millisecond, "b", "c") // a ages out again
 	if m.Master() != "b" {
 		t.Fatalf("after a-reflap master=%q want b", m.Master())
 	}
 	// a returns for the second time — cooldown restarts here.
 	// Wait less than a full cooldown — b should still be master.
 	heartbeatLoop(t, m, 100*time.Millisecond, "a", "b", "c")
 	if m.Master() != "b" {
 		t.Errorf("partway through fresh cooldown master=%q want b", m.Master())
 	}
 	// Past the full fresh cooldown, a takes over.
 	heartbeatLoop(t, m, 150*time.Millisecond, "a", "b", "c")
 	if m.Master() != "a" {
 		t.Errorf("after fresh cooldown master=%q want a", m.Master())
 	}
 }
 func TestNewMasterAfterQuorumLossIgnoresCooldown(t *testing.T) {
 	_, m := threeNode("b")
 	m.deadAfter = 50 * time.Millisecond
 	m.masterCooldown = 1 * time.Hour // would block election if applied
 	// Bootstrap into no-master state by letting all peers age out.
 	m.markLive("a")
 	m.markLive("b")
 	m.markLive("c")
 	m.recomputeMaster()
 	time.Sleep(80 * time.Millisecond)
 	m.markLive("b")
 	m.recomputeMaster()
 	if m.Master() != "" {
 		t.Fatalf("master=%q want empty (quorum lost)", m.Master())
 	}
 	// Quorum regained — incumbent is empty, election must be immediate.
 	m.markLive("a")
 	m.markLive("b")
 	m.recomputeMaster()
 	if m.Master() != "a" {
 		t.Errorf("post-recovery master=%q want a (no cooldown when empty)", m.Master())
 	}
 }
 func TestVersionObserverFiresOnHigherVersion(t *testing.T) {
 	cluster := &config.ClusterConfig{Version: 2}
 	m := New("a", cluster, nil)
Author	SHA1	Message	Date
Axodouble	ea30dbb895	Updated changelog for actual v0.1.0 Container image / image (push) Successful in 1m52s Details Release / release (push) Successful in 2m1s Details	2026-05-15 07:37:44 +00:00
Axodouble	1e2e382867	Updated docs, readme, & changelog Container image / image (push) Successful in 1m40s Details	2026-05-15 07:36:01 +00:00
Axodouble	ed25e9ed68	Fix #3 by adding a cooldown to the master election process Container image / image (push) Successful in 1m40s Details	2026-05-15 07:32:15 +00:00
Axodouble	c55482664c	Fixed install script and socket path not working on older install script, socketpath now honors runtime directory Container image / image (push) Successful in 1m39s Details	2026-05-15 07:17:28 +00:00