Fixed odd issue with the cli not finding the socket properly

Added better documentation for fixing my own broken installs, and updated the install script to patch issues
Updated changelog for actual v0.1.0
2026-05-15 08:05:55 +00:00 · 2026-05-15 07:56:13 +00:00 · 2026-05-15 07:37:44 +00:00 · 2026-05-15 07:36:01 +00:00 · 2026-05-15 07:32:15 +00:00 · 2026-05-15 07:17:28 +00:00
13 changed files with 464 additions and 13 deletions
@@ -4,6 +4,57 @@ All notable changes to this project are documented here. The format
 follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and
 this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [v0.1.1] — 2026-05-15
+
+### Changed
+
+- **`install.sh` now repairs data-dir permissions on every run.**
+  Re-running the installer reasserts the canonical ownership
+  (`quptime:quptime`) and modes across `/etc/quptime/` — `0750` on
+  the dir, `0700` on `keys/`, `0600` on `node.yaml`, `cluster.yaml`,
+  `trust.yaml`, and `keys/private.pem`, `0644` on `keys/public.pem`
+  and `keys/cert.pem`. Makes the installer the one-step recovery
+  path when something has tampered with modes (e.g. a stray
+  `chmod -R`, a backup restore, or an accidental `sudo qu init`
+  that left files owned by root). Unknown files in the dir are left
+  alone.
+
+### Fixed
+
+- **CLI socket lookup as the daemon user.** `sudo -u quptime qu …`
+  no longer fails with `dial daemon socket /tmp/quptime-quptime/…:
+  no such file or directory` while the system daemon is running.
+  `config.SocketPath()` now probes the canonical systemd location
+  (`/run/quptime/quptime.sock`, then `/var/run/quptime/quptime.sock`)
+  regardless of euid before falling back to per-user paths, so the
+  CLI reaches the daemon's socket even when `sudo` has stripped
+  `RUNTIME_DIRECTORY` and `XDG_RUNTIME_DIR` from the environment.
+
+## [v0.1.0] — 2026-05-15
+
+### Changed
+
+- **Master election cooldown (2 min).** A returning peer with a
+  lower NodeID no longer reclaims master the instant it reappears.
+  It must stay continuously live for `DefaultMasterCooldown`
+  (2 minutes) before displacing the incumbent. Bootstrap and
+  quorum-regained-from-empty still elect immediately; the cooldown
+  only protects an active incumbent. Fixes #3: a self-monitoring
+  master (TCP check on its own `:9901`) would otherwise flap the
+  role in lock-step with its own restart.
+
+### Fixed
+
+- #1 Previously up services are alerted as going back up if the master goes down.
+  Ignore `unknown` -> `up` transitions during master election; still
+  alert on `unknown` -> `down` by design.
+
+## [v0.0.2] — 2026-05-15
+
+### Fixed
+
+- Text template field in the TUI did not support newlines, causing multi-line templates to render as a single line and losing formatting. This has been fixed by changing the field into a textarea and escaping the `enter` key to insert newlines.
+
 ## [v0.0.1] — 2026-05-15

 Initial public release.
@@ -87,3 +138,5 @@ Initial public release.
  Planned for a future release.

 [v0.0.1]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.0.1
+[v0.1.0]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.1.0
+[v0.1.1]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.1.1
@@ -94,7 +94,11 @@ the hysteresis that absorbs network blips.

 Master election is deterministic: among the live members of the quorum,
 the node with the lexicographically smallest NodeID wins. No
-negotiation, no split-brain window.
+negotiation, no split-brain window. A 2-minute **master cooldown**
+keeps the current master in place until a returning lower-NodeID peer
+has been continuously live for the full window, so a self-monitoring
+master that briefly drops doesn't flap the role back the instant it
+reappears.

 `cluster.yaml` is the single replicated source of truth (peers, checks,
 alerts). Mutations from the CLI route through the master, which bumps a
@@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected
 master changes (including transitions to and from "no master"). Use it
 to spot flappy clusters.

+### Master cooldown
+
+The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
+primary master is also being monitored by `qu` itself (a TCP check on
+its own `:9901`, say), a brief restart causes a master flap *and* a
+state flap in lock-step. The new master sees the old master come back
+on the next tick and immediately hands the role back, taking the
+just-recovering node from `unknown` to `up` with no quiet period.
+
+To absorb that, the quorum manager applies a **master cooldown**
+(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
+may displace the incumbent. The rules:
+
+- The cooldown timer starts on the **first heartbeat after a
+  dead-after gap** — i.e. when a peer re-enters the live set after
+  having aged out. Continuous heartbeats never restart it.
+- A flap during the cooldown resets the timer; the returning peer
+  must clear a full fresh window before taking over.
+- The cooldown applies **only when an incumbent master exists**.
+  Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
+  live peer immediately, because there is no role to protect.
+- If the incumbent drops out of the live set, the cooldown is
+  irrelevant — any live peer may take over without waiting.
+
+The constant lives in `internal/quorum/manager.go`. Lower it for
+faster fail-back at the cost of monitoring-self flap risk; raise it
+to give a recovering master longer to settle before reclaiming the
+role.
+
 ## Catch-up when a node reconnects

 This is the scenario most people ask about: node C is offline, the
@@ -70,6 +70,15 @@ What it does:
   `/etc/systemd/system/quptime.service` (hardened — matches the unit
   in [systemd.md](deployment/systemd.md)). Enables but does not start
   the service, so you can configure identity before first boot.
+5. Repairs ownership and modes under `/etc/quptime/` to the canonical
+   layout (`0750` on the dir, `0700` on `keys/`, `0600` on
+   `node.yaml` / `cluster.yaml` / `trust.yaml` / `keys/private.pem`,
+   `0644` on `keys/public.pem` / `keys/cert.pem`). This makes the
+   installer idempotent for permission damage — if something
+   tightened or loosened modes (a stray `chmod -R`, a misguided
+   backup restore, an accidental `sudo qu init`), re-running
+   `install.sh` puts everything back without touching the contents
+   of those files.

 ## Build from source

@@ -183,6 +183,7 @@ Options:
 | `quorum`       | `true`         | `false` — no mutations, no alerts.                        |
 | `master`       | a NodeID       | `(none — ...)` — quorum lost or election in flight.       |
 | `term`         | slow growth    | rapid growth → master flapping, network unstable.         |
+| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
 | `config ver`   | identical across nodes | divergence → a node is stuck pulling.             |

 A simple cron sentinel on each node:
@@ -35,6 +35,25 @@ flapping. Causes:
 - Heartbeat timeouts (default 4s) are too tight for your inter-node
  link. Rebuild with a higher `DefaultDeadAfter` if you need it.

+## Primary master came back but the cluster hasn't switched to it
+
+**What it means.** Working as designed. After a returning peer with a
+lower NodeID rejoins, the quorum manager waits
+`DefaultMasterCooldown` (2 minutes) before letting it displace the
+incumbent. The window prevents a self-monitoring master from flapping
+the role in lock-step with its own restart.
+
+How to confirm:
+
+- `qu status` on every node shows the same (current) master and a
+  steady `term` — not flapping. The lower-NodeID peer is in the live
+  set but not yet master.
+- After ~2 minutes of continuous liveness, `term` bumps once and the
+  master switches to the lower-NodeID peer.
+
+If you need a different window, change `DefaultMasterCooldown` in
+`internal/quorum/manager.go` and rebuild.
+
 ## A check is stuck in `unknown`

 **What it means.** The aggregator has no fresh reports for that check.
@@ -153,7 +172,9 @@ still see this error, the most likely causes are:

 - The data directory is read-only or owned by a different user — the
  bootstrap can't write `node.yaml`. Fix permissions on
-  `$QUPTIME_DIR`.
+  `$QUPTIME_DIR`. The fastest fix on a standard install is just to
+  re-run `install.sh` — it reasserts the canonical ownership and
+  modes on the whole tree without touching your config.
 - Something else removed `node.yaml` mid-run (a config-management
  tool, a misconfigured volume). Re-run `qu serve` and it will
  rebuild from env, or run `qu init` manually with the flags you
@@ -178,7 +199,9 @@ load private key: ...
 ```

 Permissions on `keys/private.pem` are wrong — should be 0600 and owned
-by the daemon user. Fix and restart.
+by the daemon user. Fix and restart. Re-running `install.sh` on a
+standard install is the easiest path: it repairs ownership and modes
+on the entire data dir.

 ## Probes look much slower than expected

@@ -175,6 +175,63 @@ fi

 install -d -o "$SERVICE_USER" -g "$SERVICE_GROUP" -m 0750 "$DATA_DIR"

+# Repair ownership and permissions on the data dir's contents. Catches:
+#   - re-running the installer over a previous install where the
+#     service user/group changed.
+#   - the operator ran `qu init` or `qu serve` as root once (easy
+#     mistake: `sudo qu init` is shorter than the documented
+#     `sudo -u quptime qu init`). When the daemon runs as root its
+#     DataDir() resolves to /etc/quptime, so any files it writes land
+#     owned by root:root — the systemd service then fails with
+#     `open node.yaml: permission denied`.
+#   - someone or something (a stray `chmod -R`, a misguided backup
+#     restore) tightened or loosened modes. Re-running the installer
+#     should be enough to get back to a working baseline.
+# The canonical layout (mirrors the modes the daemon writes itself
+# in internal/config and internal/crypto):
+#   /etc/quptime/                 quptime:quptime  0750
+#   /etc/quptime/keys/            quptime:quptime  0700
+#   /etc/quptime/node.yaml        quptime:quptime  0600
+#   /etc/quptime/cluster.yaml     quptime:quptime  0600
+#   /etc/quptime/trust.yaml       quptime:quptime  0600
+#   /etc/quptime/keys/private.pem quptime:quptime  0600
+#   /etc/quptime/keys/public.pem  quptime:quptime  0644
+#   /etc/quptime/keys/cert.pem    quptime:quptime  0644
+# The runtime dir /var/run/quptime is owned by systemd via
+# RuntimeDirectory= and rebuilt at each service start, so we leave it
+# alone.
+repair_perms() {
+    # Always reset the top-level dir mode — `install -d` only sets it
+    # on creation, not on re-run.
+    chown "$SERVICE_USER:$SERVICE_GROUP" "$DATA_DIR"
+    chmod 0750 "$DATA_DIR"
+
+    # Reassert ownership across the whole tree in one pass.
+    if [ -n "$(ls -A "$DATA_DIR" 2>/dev/null)" ]; then
+        chown -R "$SERVICE_USER:$SERVICE_GROUP" "$DATA_DIR"
+    fi
+
+    # keys/ is a directory with its own tighter mode.
+    if [ -d "$DATA_DIR/keys" ]; then
+        chmod 0700 "$DATA_DIR/keys"
+    fi
+
+    # Each known file gets its canonical mode if it exists. We don't
+    # create anything that isn't already there — that's `qu init`'s
+    # job — and we don't touch unknown files an operator may have
+    # parked in the dir.
+    local f
+    for f in node.yaml cluster.yaml trust.yaml keys/private.pem; do
+        [ -f "$DATA_DIR/$f" ] && chmod 0600 "$DATA_DIR/$f"
+    done
+    for f in keys/public.pem keys/cert.pem; do
+        [ -f "$DATA_DIR/$f" ] && chmod 0644 "$DATA_DIR/$f"
+    done
+}
+
+repair_perms
+echo "> reasserted ownership ($SERVICE_USER:$SERVICE_GROUP) and modes under $DATA_DIR"
+
 echo "> writing $SERVICE_FILE"
 cat > "$SERVICE_FILE" <<'EOF'
 [Unit]
@@ -252,11 +309,18 @@ Next steps:
              # On follower nodes, also set the shared join secret:
              # Environment=QUPTIME_CLUSTER_SECRET=<paste from first node>

-       b) Or run \`qu init\` once explicitly:
+       b) Or run \`qu init\` once explicitly. IMPORTANT: run as the
+          ${SERVICE_USER} user, not root — otherwise node.yaml lands
+          owned by root and the service can't read it on start.

            sudo -u ${SERVICE_USER} QUPTIME_DIR=${DATA_DIR} \\
              qu init --advertise <this-host>:9901

+          If you already ran it as root and the service is failing
+          with "permission denied" on node.yaml, repair with:
+
+            sudo chown -R ${SERVICE_USER}:${SERVICE_GROUP} ${DATA_DIR}
+
  2. Start the service:

       sudo systemctl start ${SERVICE_NAME}
@@ -25,12 +25,20 @@ type discordPayload struct {
 }

 // sendDiscord posts msg.Subject + body to the configured webhook URL.
+// When the alert has a custom BodyTemplate, the rendered body is shipped
+// verbatim — the operator has opted out of the default subject header
+// and code-block wrapping in favour of their own formatting.
 func sendDiscord(a *config.Alert, msg Message) error {
 	if a.DiscordWebhook == "" {
 		return errors.New("discord webhook url not set")
 	}

-	content := msg.Subject + "\n```\n" + msg.Body + "\n```"
+	var content string
+	if a.BodyTemplate != "" {
+		content = msg.Body
+	} else {
+		content = msg.Subject + "\n```\n" + msg.Body + "\n```"
+	}
 	raw, err := json.Marshal(discordPayload{Content: content})
 	if err != nil {
 		return err
@@ -27,7 +27,7 @@ func New(cluster *config.ClusterConfig, selfID string, logger *log.Logger) *Disp

 // OnTransition is wired as checks.TransitionFn.
 func (d *Dispatcher) OnTransition(check *config.Check, from, to checks.State, snap checks.Snapshot) {
-	if to == checks.StateUnknown {
+	if !shouldAlert(from, to) {
 		return
 	}
 	alerts := d.cluster.EffectiveAlertsFor(check)
@@ -77,6 +77,25 @@ func (d *Dispatcher) Test(alertID string) error {
 	return d.dispatchOne(alert, msg)
 }

+// shouldAlert decides whether a committed state transition warrants
+// firing the configured alert channels.
+//
+// A fresh master's aggregator starts every check at StateUnknown, so
+// the first successful evaluation always commits Unknown→Up. Without
+// filtering, every master failover (or daemon restart) would spam an
+// "is now UP" alert for every healthy check. We treat Unknown→Up as a
+// silent cold start; real recoveries (Down→Up) and any transition to
+// Down still alert.
+func shouldAlert(from, to checks.State) bool {
+	if to == checks.StateUnknown {
+		return false
+	}
+	if from == checks.StateUnknown && to == checks.StateUp {
+		return false
+	}
+	return true
+}
+
 func (d *Dispatcher) dispatchOne(a *config.Alert, msg Message) error {
 	switch a.Type {
 	case config.AlertSMTP:
@@ -0,0 +1,30 @@
+package alerts
+
+import (
+	"testing"
+
+	"git.cer.sh/axodouble/quptime/internal/checks"
+)
+
+func TestShouldAlertFiltersColdStartUp(t *testing.T) {
+	cases := []struct {
+		name string
+		from checks.State
+		to   checks.State
+		want bool
+	}{
+		{"cold start to up (master failover / daemon restart)", checks.StateUnknown, checks.StateUp, false},
+		{"cold start to down still alerts", checks.StateUnknown, checks.StateDown, true},
+		{"real recovery alerts", checks.StateDown, checks.StateUp, true},
+		{"regression alerts", checks.StateUp, checks.StateDown, true},
+		{"stale (up to unknown) suppressed", checks.StateUp, checks.StateUnknown, false},
+		{"stale (down to unknown) suppressed", checks.StateDown, checks.StateUnknown, false},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			if got := shouldAlert(c.from, c.to); got != c.want {
+				t.Errorf("shouldAlert(%s→%s) = %v, want %v", c.from, c.to, got, c.want)
+			}
+		})
+	}
+}
@@ -16,6 +16,7 @@ import (
 	"errors"
 	"os"
 	"path/filepath"
+	"strings"
 )

 // Default file names. Callers should always go through DataDir() so an
@@ -55,10 +56,47 @@ func DataDir() string {
 }

 // SocketPath returns the unix socket used for local CLI ↔ daemon control.
+//
+// Resolution order:
+//  1. $QUPTIME_SOCKET — explicit operator override.
+//  2. $RUNTIME_DIRECTORY — set by systemd when the unit declares
+//     RuntimeDirectory=quptime. This is the path the daemon uses
+//     when run under the packaged unit: /run/quptime/quptime.sock.
+//  3. The canonical system socket path — /run/quptime/quptime.sock —
+//     if it exists. This catches the CLI side regardless of who is
+//     invoking it: `sudo -u quptime qu status` strips RUNTIME_DIRECTORY
+//     and XDG_RUNTIME_DIR, so without this probe the CLI falls all
+//     the way through to /tmp/quptime-<user>/… and reports "no such
+//     file" even while the daemon is happily listening.
+//  4. /var/run/quptime/… when euid is 0 (CLI side, packaged installs
+//     on systems where /var/run isn't a symlink to /run).
+//  5. $XDG_RUNTIME_DIR/quptime/… for user-mode installs.
+//  6. /tmp/quptime-<user>/… as a last resort.
 func SocketPath() string {
 	if v := os.Getenv("QUPTIME_SOCKET"); v != "" {
 		return v
 	}
+	if v := os.Getenv("RUNTIME_DIRECTORY"); v != "" {
+		// systemd may pass multiple colon-separated entries when more
+		// than one RuntimeDirectory= is declared. Ours is single, but
+		// be defensive in case a future unit adds more.
+		if i := strings.IndexByte(v, ':'); i >= 0 {
+			v = v[:i]
+		}
+		return filepath.Join(v, SocketName)
+	}
+	// If a system-managed daemon is already listening, route there
+	// regardless of euid. Without this, `sudo -u quptime qu …` can't
+	// find the socket the daemon (also running as quptime) created
+	// via RuntimeDirectory=.
+	for _, p := range []string{
+		"/run/quptime/" + SocketName,
+		"/var/run/quptime/" + SocketName,
+	} {
+		if _, err := os.Stat(p); err == nil {
+			return p
+		}
+	}
 	if os.Geteuid() == 0 {
 		return "/var/run/quptime/" + SocketName
 	}
@@ -34,6 +34,12 @@ import (
 const (
 	DefaultHeartbeatInterval = 1 * time.Second
 	DefaultDeadAfter         = 4 * time.Second
+	// DefaultMasterCooldown is the grace period a returning peer must
+	// stay continuously live before it's allowed to displace the
+	// currently-elected master. Without it, a self-monitoring master
+	// that briefly drops would reclaim the role immediately on return
+	// and disrupt anything watching its TCP port.
+	DefaultMasterCooldown = 2 * time.Minute
 )

 // VersionObserver is invoked whenever a heartbeat exchange reveals
@@ -50,12 +56,14 @@ type Manager struct {

 	heartbeatInterval time.Duration
 	deadAfter         time.Duration
+	masterCooldown    time.Duration

-	mu       sync.RWMutex
-	term     uint64
-	masterID string
-	lastSeen map[string]time.Time // peerID -> last contact (sent or recv)
-	addrOf   map[string]string    // peerID -> advertise addr (last known)
+	mu        sync.RWMutex
+	term      uint64
+	masterID  string
+	lastSeen  map[string]time.Time // peerID -> last contact (sent or recv)
+	liveSince map[string]time.Time // peerID -> start of current liveness streak
+	addrOf    map[string]string    // peerID -> advertise addr (last known)

 	observer VersionObserver
 }
@@ -70,7 +78,9 @@ func New(selfID string, cluster *config.ClusterConfig, client *transport.Client)
 		client:            client,
 		heartbeatInterval: DefaultHeartbeatInterval,
 		deadAfter:         DefaultDeadAfter,
+		masterCooldown:    DefaultMasterCooldown,
 		lastSeen:          map[string]time.Time{},
+		liveSince:         map[string]time.Time{},
 		addrOf:            map[string]string{},
 	}
 }
@@ -242,7 +252,15 @@ func (m *Manager) tick(ctx context.Context) {

 func (m *Manager) markLive(id string) {
 	m.mu.Lock()
-	m.lastSeen[id] = time.Now()
+	now := time.Now()
+	prev, ok := m.lastSeen[id]
+	// A peer entering its first liveness streak — or returning after
+	// the dead-after window expired — resets liveSince. Subsequent
+	// heartbeats within the streak leave it untouched.
+	if !ok || now.Sub(prev) > m.deadAfter {
+		m.liveSince[id] = now
+	}
+	m.lastSeen[id] = now
 	m.mu.Unlock()
 }

@@ -276,7 +294,41 @@ func (m *Manager) recomputeMaster() {

 	var newMaster string
 	if len(live) >= quorum && len(live) > 0 {
-		newMaster = live[0] // lowest NodeID wins
+		// Without an incumbent the cluster is bootstrapping or
+		// has just regained quorum, so elect immediately — there's
+		// nothing to protect from a handoff.
+		if m.masterID == "" {
+			newMaster = live[0]
+		} else {
+			newMaster = m.masterID
+			now := time.Now()
+			incumbentLive := false
+			for _, id := range live {
+				if id == m.masterID {
+					incumbentLive = true
+					break
+				}
+			}
+			// If the incumbent is no longer live, any live peer
+			// may take over without waiting.
+			if !incumbentLive {
+				newMaster = live[0]
+			} else {
+				// Incumbent is live. A peer with a lower NodeID
+				// may only displace it after it has stayed
+				// continuously live for masterCooldown.
+				for _, id := range live {
+					if id >= m.masterID {
+						break // sorted ascending — nobody lower left
+					}
+					since, ok := m.liveSince[id]
+					if ok && now.Sub(since) >= m.masterCooldown {
+						newMaster = id
+						break
+					}
+				}
+			}
+		}
 	}
 	if newMaster != m.masterID {
 		m.term++
@@ -119,6 +119,127 @@ func TestDeadAfterEvictsStaleLiveness(t *testing.T) {
 	}
 }

+// heartbeatLoop simulates the production heartbeat cadence — calling
+// markLive for the given peers more frequently than deadAfter, so a
+// peer that's "live throughout" never has its liveSince reset by the
+// dead-after gap heuristic. It returns when the context's deadline
+// hits.
+func heartbeatLoop(t *testing.T, m *Manager, dur time.Duration, peers ...string) {
+	t.Helper()
+	deadline := time.Now().Add(dur)
+	interval := m.deadAfter / 4
+	if interval < time.Millisecond {
+		interval = time.Millisecond
+	}
+	for time.Now().Before(deadline) {
+		for _, p := range peers {
+			m.markLive(p)
+		}
+		m.recomputeMaster()
+		time.Sleep(interval)
+	}
+}
+
+func TestReturningLowerIDWaitsForCooldown(t *testing.T) {
+	_, m := threeNode("b")
+	m.deadAfter = 80 * time.Millisecond
+	m.masterCooldown = 200 * time.Millisecond
+
+	// Bootstrap: all three live, "a" elected.
+	m.markLive("a")
+	m.markLive("b")
+	m.markLive("c")
+	m.recomputeMaster()
+	if m.Master() != "a" {
+		t.Fatalf("initial master=%q want a", m.Master())
+	}
+
+	// "a" drops — only b/c heartbeat. Long enough to age a out and let
+	// b take over.
+	heartbeatLoop(t, m, 120*time.Millisecond, "b", "c")
+	if m.Master() != "b" {
+		t.Fatalf("after a-drop master=%q want b", m.Master())
+	}
+
+	// "a" returns. Verify b stays master for less than the cooldown.
+	heartbeatLoop(t, m, 120*time.Millisecond, "a", "b", "c")
+	if m.Master() != "b" {
+		t.Errorf("mid-cooldown master=%q want b", m.Master())
+	}
+
+	// Past the cooldown, a reclaims master.
+	heartbeatLoop(t, m, 120*time.Millisecond, "a", "b", "c")
+	if m.Master() != "a" {
+		t.Errorf("after cooldown master=%q want a", m.Master())
+	}
+}
+
+func TestCooldownResetsOnFlap(t *testing.T) {
+	_, m := threeNode("b")
+	m.deadAfter = 80 * time.Millisecond
+	m.masterCooldown = 200 * time.Millisecond
+
+	m.markLive("a")
+	m.markLive("b")
+	m.markLive("c")
+	m.recomputeMaster()
+
+	// a drops, b becomes master.
+	heartbeatLoop(t, m, 120*time.Millisecond, "b", "c")
+	if m.Master() != "b" {
+		t.Fatalf("master=%q want b", m.Master())
+	}
+
+	// a returns briefly, then drops again before cooldown elapses.
+	heartbeatLoop(t, m, 100*time.Millisecond, "a", "b", "c")
+	if m.Master() != "b" {
+		t.Fatalf("during first cooldown master=%q want b", m.Master())
+	}
+	heartbeatLoop(t, m, 120*time.Millisecond, "b", "c") // a ages out again
+	if m.Master() != "b" {
+		t.Fatalf("after a-reflap master=%q want b", m.Master())
+	}
+
+	// a returns for the second time — cooldown restarts here.
+	// Wait less than a full cooldown — b should still be master.
+	heartbeatLoop(t, m, 100*time.Millisecond, "a", "b", "c")
+	if m.Master() != "b" {
+		t.Errorf("partway through fresh cooldown master=%q want b", m.Master())
+	}
+
+	// Past the full fresh cooldown, a takes over.
+	heartbeatLoop(t, m, 150*time.Millisecond, "a", "b", "c")
+	if m.Master() != "a" {
+		t.Errorf("after fresh cooldown master=%q want a", m.Master())
+	}
+}
+
+func TestNewMasterAfterQuorumLossIgnoresCooldown(t *testing.T) {
+	_, m := threeNode("b")
+	m.deadAfter = 50 * time.Millisecond
+	m.masterCooldown = 1 * time.Hour // would block election if applied
+
+	// Bootstrap into no-master state by letting all peers age out.
+	m.markLive("a")
+	m.markLive("b")
+	m.markLive("c")
+	m.recomputeMaster()
+	time.Sleep(80 * time.Millisecond)
+	m.markLive("b")
+	m.recomputeMaster()
+	if m.Master() != "" {
+		t.Fatalf("master=%q want empty (quorum lost)", m.Master())
+	}
+
+	// Quorum regained — incumbent is empty, election must be immediate.
+	m.markLive("a")
+	m.markLive("b")
+	m.recomputeMaster()
+	if m.Master() != "a" {
+		t.Errorf("post-recovery master=%q want a (no cooldown when empty)", m.Master())
+	}
+}
+
 func TestVersionObserverFiresOnHigherVersion(t *testing.T) {
 	cluster := &config.ClusterConfig{Version: 2}
 	m := New("a", cluster, nil)
Author	SHA1	Message	Date
Axodouble	a1d74cf36d	Fixed odd issue with the cli not finding the socket properly Release / release (push) Successful in 3m21s Details Container image / image (push) Successful in 5m57s Details	2026-05-15 08:05:55 +00:00
Axodouble	f60b0a0609	Added better documentation for fixing my own broken installs, and updated the install script to patch issues Container image / image (push) Successful in 4m40s Details	2026-05-15 07:56:13 +00:00
Axodouble	ea30dbb895	Updated changelog for actual v0.1.0 Container image / image (push) Successful in 1m52s Details Release / release (push) Successful in 2m1s Details	2026-05-15 07:37:44 +00:00
Axodouble	1e2e382867	Updated docs, readme, & changelog Container image / image (push) Successful in 1m40s Details	2026-05-15 07:36:01 +00:00
Axodouble	ed25e9ed68	Fix #3 by adding a cooldown to the master election process Container image / image (push) Successful in 1m40s Details	2026-05-15 07:32:15 +00:00
Axodouble	c55482664c	Fixed install script and socket path not working on older install script, socketpath now honors runtime directory Container image / image (push) Successful in 1m39s Details	2026-05-15 07:17:28 +00:00
Axodouble	3c85caabcf	Fix Previously up services are alerted as going back up if the master goes down #1 Container image / image (push) Successful in 1m45s Details Release / release (push) Successful in 1m44s Details This gets rid of the alert on unknown -> up, will still alert unknown -> down by design.	2026-05-15 07:01:29 +00:00
Axodouble	8638ab5432	Updated formatting for discord messages Container image / image (push) Successful in 1m45s Details	2026-05-15 06:55:43 +00:00
Axodouble	a11b31f160	Updated changelog Container image / image (push) Successful in 1m39s Details	2026-05-15 06:44:18 +00:00