QUptime/docs/security.md

# Security

The trust model in one page. Read this before deciding where to put
`qu` and who can talk to it.

## What `qu` is trying to defend against

- **Eavesdropping on cluster traffic.** Defended: TLS 1.3 only,
  fingerprint-pinned per peer.
- **MITM on the cluster's inter-node link.** Defended: TLS 1.3 with
  out-of-band fingerprint verification at `qu node add`.
- **A random internet host enrolling itself as a peer.** Defended:
  pre-shared cluster secret on every `Join`.
- **A compromised peer issuing forged cluster-config mutations.** Not
  defended. A peer trusted enough to be in `cluster.yaml.peers` can
  propose mutations through the master. Treat membership as a
  privilege.
- **A compromised peer becoming master.** Election is deterministic on
  the smallest live `NodeID`, so a compromised peer can become master
  if its `NodeID` sorts first. The master can rewrite `cluster.yaml`
  arbitrarily. This is the worst-case blast radius from one compromised
  node.
- **DoS by handshake flood.** Not directly defended at the application
  layer. The TLS stack accepts anyone's handshake; rate-limiting belongs
  at the firewall — see [public-internet.md](deployment/public-internet.md).

## The three secrets on disk

| Secret                     | What it is                                | Loss impact                                  |
| -------------------------- | ----------------------------------------- | -------------------------------------------- |
| `keys/private.pem`         | RSA private key, this node's identity.    | Anyone with it can impersonate this node.    |
| `node.yaml.cluster_secret` | Pre-shared base64 string.                 | Anyone with it can `Join` the cluster.       |
| `trust.yaml.entries[].cert_pem` | Other peers' public certs (not secrets, but they enable mTLS). | Loss only forces re-trust. |

The first two are real secrets and live under `0600` permissions in
the data directory. Back them up; never commit them; never paste them
in chat.

## TLS handshake step by step

For every inter-node call:

1. Caller dials peer on its `advertise` address.
2. TLS 1.3 handshake. Both sides present their self-signed leaf cert.
3. The caller's `VerifyPeerCertificate` (set in
   `internal/transport/tls.go`) computes the SPKI fingerprint of the
   server's cert and compares it against `trust.yaml`. If the caller
   knows which `NodeID` it expected, a strict verifier ensures the
   fingerprint matches *that specific* entry — not just any trusted
   peer.
4. The server's TLS layer accepts any client cert (`RequireAnyClientCert`,
   `InsecureSkipVerify: true`) because trust is enforced one layer up.
5. The RPC dispatcher reads the client's cert, computes its
   fingerprint, and looks it up in the server's `trust.yaml`. If no
   entry exists, only the `Join` method is permitted.
6. `Join` performs a constant-time comparison of the inbound
   `ClusterSecret` against `node.yaml.cluster_secret`. Mismatch →
   refusal.

So:

- An adversary who gets your **public** cert can't impersonate you.
- An adversary who gets your **fingerprint** can't impersonate you.
- An adversary who gets your **private key** *can* impersonate you to
  any peer that trusts your fingerprint.

## The TOFU step

`qu node add <host:port>` runs a one-shot insecure dial against the
target (the only place `InsecureBootstrapConfig` is used in the
codebase, see `internal/transport/tls.go:91`). It fetches the
remote's cert, prints the fingerprint, and asks for confirmation.

This is **identical** to SSH's first-connection prompt. The operator
must verify the fingerprint out of band — by running `qu status` on
the remote side, or by reading `keys/cert.pem` directly, or via a
known-good distribution channel.

If you skip verification, you trust the network at that moment. If
the network was MITM'd at exactly that moment, you trust the
attacker. After the prompt, the cert is pinned and the window closes.

## Cluster secret rotation

There is no built-in command to rotate the cluster secret. The hard
part isn't generating a new one — it's distributing it consistently
across every node. The pragmatic recipe:

1. Generate a new secret on one node and copy it to every other node.
2. Update `node.yaml.cluster_secret` on every node (manual edit).
3. Restart each daemon one at a time, verifying quorum returns
   between restarts.

Rotation only protects future `Join` calls, not anything else. If you
suspect the old secret has been seen by an adversary, also assume any
peer that was added during the leaked window is compromised, and
re-init those peers from scratch.

## Identity rotation

To roll a node's RSA keypair (e.g., the private key was on a laptop
that got stolen):

```sh
# On the compromised node:
sudo systemctl stop quptime
sudo rm -rf /etc/quptime
sudo -u quptime qu init \
  --advertise this-host.example.com:9901 \
  --secret '<existing cluster secret>'
sudo systemctl start quptime

# On a surviving healthy node:
sudo -u quptime qu node remove <old-node-id>      # evict the old identity
sudo -u quptime qu node add this-host.example.com:9901
```

The new `node_id` is a fresh UUID; the old one is gone for good. Any
historical references to it (e.g., the `updated_by` field on past
versions of `cluster.yaml`) are cosmetic.

## What the local control socket protects

`$XDG_RUNTIME_DIR/quptime/quptime.sock` (or `/var/run/quptime/...`) is
the channel the CLI uses to talk to the local daemon. It's `0600`
permissioned and authenticated solely by filesystem ACLs — no TLS, no
secrets in the protocol.

Anyone who can `read+write` the socket can:

- Propose cluster mutations (will be relayed to the master).
- Read full cluster state including `cluster.yaml`.
- Trigger test alerts.

So: don't put the daemon's user in a group that other unprivileged
users share. The default systemd setup with a dedicated `quptime`
user gets this right.

## Hardening checklist

- [ ] Dedicated `quptime` system user.
- [ ] Data directory owned by that user, mode 0750.
- [ ] `keys/private.pem` mode 0600.
- [ ] `node.yaml` mode 0600.
- [ ] systemd unit uses `ProtectSystem=strict`, `NoNewPrivileges=true`,
      and the rest of the hardening directives in
      [systemd.md](deployment/systemd.md).
- [ ] If `:9901` is internet-reachable, firewall allow-list to peer
      IPs or use an overlay — see [public-internet.md](deployment/public-internet.md)
      and [tailscale.md](deployment/tailscale.md).
- [ ] Cluster secret generated by `qu init` (not chosen by a human),
      stored in your secret manager.
- [ ] Backups of `keys/` and `node.yaml` are encrypted at rest.