7 Commits

Author SHA1 Message Date
Axodouble ea30dbb895 Updated changelog for actual v0.1.0
Container image / image (push) Successful in 1m52s
Release / release (push) Successful in 2m1s
2026-05-15 07:37:44 +00:00
Axodouble 1e2e382867 Updated docs, readme, & changelog
Container image / image (push) Successful in 1m40s
2026-05-15 07:36:01 +00:00
Axodouble ed25e9ed68 Fix #3 by adding a cooldown to the master election process
Container image / image (push) Successful in 1m40s
2026-05-15 07:32:15 +00:00
Axodouble c55482664c Fixed install script and socket path not working on older install script, socketpath now honors runtime directory
Container image / image (push) Successful in 1m39s
2026-05-15 07:17:28 +00:00
Axodouble 3c85caabcf Fix Previously up services are alerted as going back up if the master goes down #1
Container image / image (push) Successful in 1m45s
Release / release (push) Successful in 1m44s
This gets rid of the alert on unknown -> up, will still alert unknown -> down by design.
2026-05-15 07:01:29 +00:00
Axodouble 8638ab5432 Updated formatting for discord messages
Container image / image (push) Successful in 1m45s
2026-05-15 06:55:43 +00:00
Axodouble a11b31f160 Updated changelog
Container image / image (push) Successful in 1m39s
2026-05-15 06:44:18 +00:00
12 changed files with 367 additions and 11 deletions
+26
View File
@@ -4,6 +4,31 @@ All notable changes to this project are documented here. The format
follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and
this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [v0.1.0] — 2026-05-15
### Changed
- **Master election cooldown (2 min).** A returning peer with a
lower NodeID no longer reclaims master the instant it reappears.
It must stay continuously live for `DefaultMasterCooldown`
(2 minutes) before displacing the incumbent. Bootstrap and
quorum-regained-from-empty still elect immediately; the cooldown
only protects an active incumbent. Fixes #3: a self-monitoring
master (TCP check on its own `:9901`) would otherwise flap the
role in lock-step with its own restart.
### Fixed
- #1 Previously up services are alerted as going back up if the master goes down.
Ignore `unknown` -> `up` transitions during master election; still
alert on `unknown` -> `down` by design.
## [v0.0.2] — 2026-05-15
### Fixed
- Text template field in the TUI did not support newlines, causing multi-line templates to render as a single line and losing formatting. This has been fixed by changing the field into a textarea and escaping the `enter` key to insert newlines.
## [v0.0.1] — 2026-05-15
Initial public release.
@@ -87,3 +112,4 @@ Initial public release.
Planned for a future release.
[v0.0.1]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.0.1
[v0.1.0]: https://git.cer.sh/axodouble/quptime/releases/tag/v0.1.0
+5 -1
View File
@@ -94,7 +94,11 @@ the hysteresis that absorbs network blips.
Master election is deterministic: among the live members of the quorum,
the node with the lexicographically smallest NodeID wins. No
negotiation, no split-brain window.
negotiation, no split-brain window. A 2-minute **master cooldown**
keeps the current master in place until a returning lower-NodeID peer
has been continuously live for the full window, so a self-monitoring
master that briefly drops doesn't flap the role back the instant it
reappears.
`cluster.yaml` is the single replicated source of truth (peers, checks,
alerts). Mutations from the CLI route through the master, which bumps a
+29
View File
@@ -118,6 +118,35 @@ The `term` integer in `qu status` is bumped every time the elected
master changes (including transitions to and from "no master"). Use it
to spot flappy clusters.
### Master cooldown
The bare "lowest-live-NodeID wins" rule has one unpleasant edge: if the
primary master is also being monitored by `qu` itself (a TCP check on
its own `:9901`, say), a brief restart causes a master flap *and* a
state flap in lock-step. The new master sees the old master come back
on the next tick and immediately hands the role back, taking the
just-recovering node from `unknown` to `up` with no quiet period.
To absorb that, the quorum manager applies a **master cooldown**
(`DefaultMasterCooldown`, 2 minutes) before a peer with a lower NodeID
may displace the incumbent. The rules:
- The cooldown timer starts on the **first heartbeat after a
dead-after gap** — i.e. when a peer re-enters the live set after
having aged out. Continuous heartbeats never restart it.
- A flap during the cooldown resets the timer; the returning peer
must clear a full fresh window before taking over.
- The cooldown applies **only when an incumbent master exists**.
Bootstrap and quorum-regained-from-empty elect the lowest-NodeID
live peer immediately, because there is no role to protect.
- If the incumbent drops out of the live set, the cooldown is
irrelevant — any live peer may take over without waiting.
The constant lives in `internal/quorum/manager.go`. Lower it for
faster fail-back at the cost of monitoring-self flap risk; raise it
to give a recovering master longer to settle before reclaiming the
role.
## Catch-up when a node reconnects
This is the scenario most people ask about: node C is offline, the
+1
View File
@@ -183,6 +183,7 @@ Options:
| `quorum` | `true` | `false` — no mutations, no alerts. |
| `master` | a NodeID | `(none — ...)` — quorum lost or election in flight. |
| `term` | slow growth | rapid growth → master flapping, network unstable. |
| `master` after a restart of the primary | unchanged for ~2 min, then bumps back | bumps back immediately → cooldown disabled or misconfigured. |
| `config ver` | identical across nodes | divergence → a node is stuck pulling. |
A simple cron sentinel on each node:
+19
View File
@@ -35,6 +35,25 @@ flapping. Causes:
- Heartbeat timeouts (default 4s) are too tight for your inter-node
link. Rebuild with a higher `DefaultDeadAfter` if you need it.
## Primary master came back but the cluster hasn't switched to it
**What it means.** Working as designed. After a returning peer with a
lower NodeID rejoins, the quorum manager waits
`DefaultMasterCooldown` (2 minutes) before letting it displace the
incumbent. The window prevents a self-monitoring master from flapping
the role in lock-step with its own restart.
How to confirm:
- `qu status` on every node shows the same (current) master and a
steady `term` — not flapping. The lower-NodeID peer is in the live
set but not yet master.
- After ~2 minutes of continuous liveness, `term` bumps once and the
master switches to the lower-NodeID peer.
If you need a different window, change `DefaultMasterCooldown` in
`internal/quorum/manager.go` and rebuild.
## A check is stuck in `unknown`
**What it means.** The aggregator has no fresh reports for that check.
+23 -1
View File
@@ -175,6 +175,21 @@ fi
install -d -o "$SERVICE_USER" -g "$SERVICE_GROUP" -m 0750 "$DATA_DIR"
# Reassert ownership on the dir's contents. Two cases this catches:
# - re-running the installer over a previous install where the
# service user/group changed
# - the operator ran `qu init` or `qu serve` as root once (easy
# mistake: `sudo qu init` is shorter than the documented
# `sudo -u quptime qu init`). When the daemon runs as root its
# DataDir() resolves to /etc/quptime, so any files it writes land
# here owned by root:root mode 0600 — the systemd service then
# fails with `open node.yaml: permission denied`.
# chown -R only changes ownership, not perms, so file modes set by
# the daemon (0600 for node.yaml, 0700 for keys/) are preserved.
if [ -n "$(ls -A "$DATA_DIR" 2>/dev/null)" ]; then
chown -R "$SERVICE_USER:$SERVICE_GROUP" "$DATA_DIR"
fi
echo "> writing $SERVICE_FILE"
cat > "$SERVICE_FILE" <<'EOF'
[Unit]
@@ -252,11 +267,18 @@ Next steps:
# On follower nodes, also set the shared join secret:
# Environment=QUPTIME_CLUSTER_SECRET=<paste from first node>
b) Or run \`qu init\` once explicitly:
b) Or run \`qu init\` once explicitly. IMPORTANT: run as the
${SERVICE_USER} user, not root — otherwise node.yaml lands
owned by root and the service can't read it on start.
sudo -u ${SERVICE_USER} QUPTIME_DIR=${DATA_DIR} \\
qu init --advertise <this-host>:9901
If you already ran it as root and the service is failing
with "permission denied" on node.yaml, repair with:
sudo chown -R ${SERVICE_USER}:${SERVICE_GROUP} ${DATA_DIR}
2. Start the service:
sudo systemctl start ${SERVICE_NAME}
+9 -1
View File
@@ -25,12 +25,20 @@ type discordPayload struct {
}
// sendDiscord posts msg.Subject + body to the configured webhook URL.
// When the alert has a custom BodyTemplate, the rendered body is shipped
// verbatim — the operator has opted out of the default subject header
// and code-block wrapping in favour of their own formatting.
func sendDiscord(a *config.Alert, msg Message) error {
if a.DiscordWebhook == "" {
return errors.New("discord webhook url not set")
}
content := msg.Subject + "\n```\n" + msg.Body + "\n```"
var content string
if a.BodyTemplate != "" {
content = msg.Body
} else {
content = msg.Subject + "\n```\n" + msg.Body + "\n```"
}
raw, err := json.Marshal(discordPayload{Content: content})
if err != nil {
return err
+20 -1
View File
@@ -27,7 +27,7 @@ func New(cluster *config.ClusterConfig, selfID string, logger *log.Logger) *Disp
// OnTransition is wired as checks.TransitionFn.
func (d *Dispatcher) OnTransition(check *config.Check, from, to checks.State, snap checks.Snapshot) {
if to == checks.StateUnknown {
if !shouldAlert(from, to) {
return
}
alerts := d.cluster.EffectiveAlertsFor(check)
@@ -77,6 +77,25 @@ func (d *Dispatcher) Test(alertID string) error {
return d.dispatchOne(alert, msg)
}
// shouldAlert decides whether a committed state transition warrants
// firing the configured alert channels.
//
// A fresh master's aggregator starts every check at StateUnknown, so
// the first successful evaluation always commits Unknown→Up. Without
// filtering, every master failover (or daemon restart) would spam an
// "is now UP" alert for every healthy check. We treat Unknown→Up as a
// silent cold start; real recoveries (Down→Up) and any transition to
// Down still alert.
func shouldAlert(from, to checks.State) bool {
if to == checks.StateUnknown {
return false
}
if from == checks.StateUnknown && to == checks.StateUp {
return false
}
return true
}
func (d *Dispatcher) dispatchOne(a *config.Alert, msg Message) error {
switch a.Type {
case config.AlertSMTP:
+30
View File
@@ -0,0 +1,30 @@
package alerts
import (
"testing"
"git.cer.sh/axodouble/quptime/internal/checks"
)
func TestShouldAlertFiltersColdStartUp(t *testing.T) {
cases := []struct {
name string
from checks.State
to checks.State
want bool
}{
{"cold start to up (master failover / daemon restart)", checks.StateUnknown, checks.StateUp, false},
{"cold start to down still alerts", checks.StateUnknown, checks.StateDown, true},
{"real recovery alerts", checks.StateDown, checks.StateUp, true},
{"regression alerts", checks.StateUp, checks.StateDown, true},
{"stale (up to unknown) suppressed", checks.StateUp, checks.StateUnknown, false},
{"stale (down to unknown) suppressed", checks.StateDown, checks.StateUnknown, false},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
if got := shouldAlert(c.from, c.to); got != c.want {
t.Errorf("shouldAlert(%s→%s) = %v, want %v", c.from, c.to, got, c.want)
}
})
}
}
+25
View File
@@ -16,6 +16,7 @@ import (
"errors"
"os"
"path/filepath"
"strings"
)
// Default file names. Callers should always go through DataDir() so an
@@ -55,10 +56,34 @@ func DataDir() string {
}
// SocketPath returns the unix socket used for local CLI ↔ daemon control.
//
// Resolution order:
// 1. $QUPTIME_SOCKET — explicit operator override
// 2. $RUNTIME_DIRECTORY — set by systemd when the unit declares
// RuntimeDirectory=quptime. This is the path that matters in
// practice: with User=quptime + PrivateTmp=true, the daemon's
// /tmp is namespaced and invisible to the root CLI shell, so a
// /tmp fallback yields "no such file" even though the daemon is
// happily listening. Anchoring on $RUNTIME_DIRECTORY puts the
// socket at /run/quptime/quptime.sock, which is the same inode
// the root-CLI default (/var/run/quptime/…) reaches via the
// /var/run → /run symlink.
// 3. /var/run/quptime/… when euid is 0 (CLI side, packaged installs)
// 4. $XDG_RUNTIME_DIR/quptime/… for user-mode installs
// 5. /tmp/quptime-<user>/… as a last resort
func SocketPath() string {
if v := os.Getenv("QUPTIME_SOCKET"); v != "" {
return v
}
if v := os.Getenv("RUNTIME_DIRECTORY"); v != "" {
// systemd may pass multiple colon-separated entries when more
// than one RuntimeDirectory= is declared. Ours is single, but
// be defensive in case a future unit adds more.
if i := strings.IndexByte(v, ':'); i >= 0 {
v = v[:i]
}
return filepath.Join(v, SocketName)
}
if os.Geteuid() == 0 {
return "/var/run/quptime/" + SocketName
}
+59 -7
View File
@@ -34,6 +34,12 @@ import (
const (
DefaultHeartbeatInterval = 1 * time.Second
DefaultDeadAfter = 4 * time.Second
// DefaultMasterCooldown is the grace period a returning peer must
// stay continuously live before it's allowed to displace the
// currently-elected master. Without it, a self-monitoring master
// that briefly drops would reclaim the role immediately on return
// and disrupt anything watching its TCP port.
DefaultMasterCooldown = 2 * time.Minute
)
// VersionObserver is invoked whenever a heartbeat exchange reveals
@@ -50,12 +56,14 @@ type Manager struct {
heartbeatInterval time.Duration
deadAfter time.Duration
masterCooldown time.Duration
mu sync.RWMutex
term uint64
masterID string
lastSeen map[string]time.Time // peerID -> last contact (sent or recv)
addrOf map[string]string // peerID -> advertise addr (last known)
mu sync.RWMutex
term uint64
masterID string
lastSeen map[string]time.Time // peerID -> last contact (sent or recv)
liveSince map[string]time.Time // peerID -> start of current liveness streak
addrOf map[string]string // peerID -> advertise addr (last known)
observer VersionObserver
}
@@ -70,7 +78,9 @@ func New(selfID string, cluster *config.ClusterConfig, client *transport.Client)
client: client,
heartbeatInterval: DefaultHeartbeatInterval,
deadAfter: DefaultDeadAfter,
masterCooldown: DefaultMasterCooldown,
lastSeen: map[string]time.Time{},
liveSince: map[string]time.Time{},
addrOf: map[string]string{},
}
}
@@ -242,7 +252,15 @@ func (m *Manager) tick(ctx context.Context) {
func (m *Manager) markLive(id string) {
m.mu.Lock()
m.lastSeen[id] = time.Now()
now := time.Now()
prev, ok := m.lastSeen[id]
// A peer entering its first liveness streak — or returning after
// the dead-after window expired — resets liveSince. Subsequent
// heartbeats within the streak leave it untouched.
if !ok || now.Sub(prev) > m.deadAfter {
m.liveSince[id] = now
}
m.lastSeen[id] = now
m.mu.Unlock()
}
@@ -276,7 +294,41 @@ func (m *Manager) recomputeMaster() {
var newMaster string
if len(live) >= quorum && len(live) > 0 {
newMaster = live[0] // lowest NodeID wins
// Without an incumbent the cluster is bootstrapping or
// has just regained quorum, so elect immediately — there's
// nothing to protect from a handoff.
if m.masterID == "" {
newMaster = live[0]
} else {
newMaster = m.masterID
now := time.Now()
incumbentLive := false
for _, id := range live {
if id == m.masterID {
incumbentLive = true
break
}
}
// If the incumbent is no longer live, any live peer
// may take over without waiting.
if !incumbentLive {
newMaster = live[0]
} else {
// Incumbent is live. A peer with a lower NodeID
// may only displace it after it has stayed
// continuously live for masterCooldown.
for _, id := range live {
if id >= m.masterID {
break // sorted ascending — nobody lower left
}
since, ok := m.liveSince[id]
if ok && now.Sub(since) >= m.masterCooldown {
newMaster = id
break
}
}
}
}
}
if newMaster != m.masterID {
m.term++
+121
View File
@@ -119,6 +119,127 @@ func TestDeadAfterEvictsStaleLiveness(t *testing.T) {
}
}
// heartbeatLoop simulates the production heartbeat cadence — calling
// markLive for the given peers more frequently than deadAfter, so a
// peer that's "live throughout" never has its liveSince reset by the
// dead-after gap heuristic. It returns when the context's deadline
// hits.
func heartbeatLoop(t *testing.T, m *Manager, dur time.Duration, peers ...string) {
t.Helper()
deadline := time.Now().Add(dur)
interval := m.deadAfter / 4
if interval < time.Millisecond {
interval = time.Millisecond
}
for time.Now().Before(deadline) {
for _, p := range peers {
m.markLive(p)
}
m.recomputeMaster()
time.Sleep(interval)
}
}
func TestReturningLowerIDWaitsForCooldown(t *testing.T) {
_, m := threeNode("b")
m.deadAfter = 80 * time.Millisecond
m.masterCooldown = 200 * time.Millisecond
// Bootstrap: all three live, "a" elected.
m.markLive("a")
m.markLive("b")
m.markLive("c")
m.recomputeMaster()
if m.Master() != "a" {
t.Fatalf("initial master=%q want a", m.Master())
}
// "a" drops — only b/c heartbeat. Long enough to age a out and let
// b take over.
heartbeatLoop(t, m, 120*time.Millisecond, "b", "c")
if m.Master() != "b" {
t.Fatalf("after a-drop master=%q want b", m.Master())
}
// "a" returns. Verify b stays master for less than the cooldown.
heartbeatLoop(t, m, 120*time.Millisecond, "a", "b", "c")
if m.Master() != "b" {
t.Errorf("mid-cooldown master=%q want b", m.Master())
}
// Past the cooldown, a reclaims master.
heartbeatLoop(t, m, 120*time.Millisecond, "a", "b", "c")
if m.Master() != "a" {
t.Errorf("after cooldown master=%q want a", m.Master())
}
}
func TestCooldownResetsOnFlap(t *testing.T) {
_, m := threeNode("b")
m.deadAfter = 80 * time.Millisecond
m.masterCooldown = 200 * time.Millisecond
m.markLive("a")
m.markLive("b")
m.markLive("c")
m.recomputeMaster()
// a drops, b becomes master.
heartbeatLoop(t, m, 120*time.Millisecond, "b", "c")
if m.Master() != "b" {
t.Fatalf("master=%q want b", m.Master())
}
// a returns briefly, then drops again before cooldown elapses.
heartbeatLoop(t, m, 100*time.Millisecond, "a", "b", "c")
if m.Master() != "b" {
t.Fatalf("during first cooldown master=%q want b", m.Master())
}
heartbeatLoop(t, m, 120*time.Millisecond, "b", "c") // a ages out again
if m.Master() != "b" {
t.Fatalf("after a-reflap master=%q want b", m.Master())
}
// a returns for the second time — cooldown restarts here.
// Wait less than a full cooldown — b should still be master.
heartbeatLoop(t, m, 100*time.Millisecond, "a", "b", "c")
if m.Master() != "b" {
t.Errorf("partway through fresh cooldown master=%q want b", m.Master())
}
// Past the full fresh cooldown, a takes over.
heartbeatLoop(t, m, 150*time.Millisecond, "a", "b", "c")
if m.Master() != "a" {
t.Errorf("after fresh cooldown master=%q want a", m.Master())
}
}
func TestNewMasterAfterQuorumLossIgnoresCooldown(t *testing.T) {
_, m := threeNode("b")
m.deadAfter = 50 * time.Millisecond
m.masterCooldown = 1 * time.Hour // would block election if applied
// Bootstrap into no-master state by letting all peers age out.
m.markLive("a")
m.markLive("b")
m.markLive("c")
m.recomputeMaster()
time.Sleep(80 * time.Millisecond)
m.markLive("b")
m.recomputeMaster()
if m.Master() != "" {
t.Fatalf("master=%q want empty (quorum lost)", m.Master())
}
// Quorum regained — incumbent is empty, election must be immediate.
m.markLive("a")
m.markLive("b")
m.recomputeMaster()
if m.Master() != "a" {
t.Errorf("post-recovery master=%q want a (no cooldown when empty)", m.Master())
}
}
func TestVersionObserverFiresOnHigherVersion(t *testing.T) {
cluster := &config.ClusterConfig{Version: 2}
m := New("a", cluster, nil)