Executor upgrades are normally fully automatic. You request the upgrade in the UI (Upgrade an executor), the executor self-applies on its next sync, and the new version reports back. The infrastructure side has nothing to do. This page covers the cases where automation needs help: stuck rollouts, failed binary swaps, and verification mismatches.Documentation Index
Fetch the complete documentation index at: https://docs.novacula.io/llms.txt
Use this file to discover all available pages before exploring further.
What “self-applies” actually does
For both backends, the trigger is the same: the control plane writes apendingUpgrade blob onto the Executor row, and the executor sees it on its next syncExecutor.
Agent
- Download the new binary into a sibling temp file under the same directory.
- Verify SHA-256 against the spec.
- Verify a minisign signature against trusted public keys baked into the running binary (not configurable at runtime).
- Hardlink the current binary as
<current>.prevfor the systemdOnFailure=rollback unit. - Atomic
rename()over the current binary. systemctl restart novacula-agent.service.
StartLimitBurst=3 over StartLimitIntervalSec=120, systemd invokes novacula-agent-rollback.service, which mvs .prev back into place and restarts. No human action required.
Operator
- Probe the registry to confirm the target image tag exists and is pullable.
- Patch the operator’s own
Deploymentto the new image. - A self-update monitor watches the rollout. If it sees
ImagePullBackOfforCrashLoopBackOffon the new replicaset within the monitor window, it patches the image back to the prior tag.
Failure modes and recovery
Stuck pendingUpgrade (executor never picks it up)
Symptom: you requested the upgrade, the UI shows pendingUpgrade for >1 minute, no progress.
- Is the executor online? Check the executor row’s status. If it’s
offline, fix that first — it can’t apply an upgrade if it can’t sync. - Is the spec stale?
pendingUpgradeblobs have a TTL (PENDING_UPGRADE_TTL_MS). If you sat on the request for too long, the executor sees it as stale and ignores it. Re-request from the UI.
Agent rolled back automatically — what now?
If the systemd rollback unit fired, the agent is back on the previous binary andonline. The control plane will see the version mismatch and clear pendingUpgrade. You need to:
- Read the failing-version logs:
journalctl -u novacula-agent --since '5 minutes ago' --no-pager. - File the failure with the release notes / changelog before re-attempting.
- If you want to retry, re-issue the upgrade — the rollback didn’t disable the path forward.
Agent didn’t roll back (binary swap succeeded but won’t start)
TheOnFailure= rollback only fires if the flap budget is exhausted. If the new binary refuses to start once and exits cleanly enough that systemd doesn’t count it as a failure (rare but possible), you’re stuck on the new binary.
Manual recovery:
.prev file is left behind by every successful self-update — it’s the last known-good binary.
Operator stuck in ImagePullBackOff and the monitor didn’t roll back
Symptom: kubectl get pods -n novacula-system shows the new pod stuck pulling, the old pod has been removed (or scaled down), and the executor is offline.
- The monitor only rolls back during a defined window after the patch. If the failure didn’t manifest in time, the monitor exits and you take over.
pendingUpgrade, and you’re back to where you started.
SHA-256 or signature mismatch
The agent will refuse the swap and write an error. The control plane keepspendingUpgrade until TTL. This means either:
- The release artifact was re-uploaded with a different binary (legitimate but rare — file with the platform team).
- Something tampered with the artifact in transit (treat as compromise).
When to skip automatic upgrades altogether
For tightly change-controlled environments, you can:- Pin the operator image tag directly in
values.yamland never request UI-driven upgrades. Runhelm upgradeon your own schedule. - For the agent, package your own binary distribution and
systemctl restartfrom your config-management tooling.
syncExecutor on its normal cadence; it just won’t pick up pending upgrades because none are issued. The downside is you lose the signed-rollout-with-rollback safety net the platform provides.