Commits
Previously, `handleJetstreamEvent` saved the time-based cursor after
every event regardless of whether `applyCommit` succeeded. That is fine
for permanently bad records (malformed JSON, schema violations) where
replaying achieves nothing, but wrong for transient infra failures
(SQLite busy, store closed during shutdown, disk full): the cursor
would advance past a perfectly good event and silently drop the
membership or repo row that backs it, with no way to recover short of
a manual replay.
`applyCommit` now distinguishes the two classes via a new
`badRecordError` wrapper. JSON decode failures in `applySpindleMember`,
`applyRepo`, and `applyRepoCollaborator` are wrapped with
`badRecord(...)` so they remain cursor-advancing. Everything else
returned from `applyCommit` is treated as transient:
`handleJetstreamEvent` logs it, returns the error to the scheduler, and
skips `SaveCursor` so the next reconnect (which already rewinds by
`jetstreamRewind`) will redeliver and retry.
LookupBuildkiteBuildByTuple sorted on created_at, an RFC3339Nano
text column. Lexical comparison of nanosecond timestamps is not
reliable: time.Format trims trailing zeros, so an instant on the
exact second renders as '...:00Z' while one nanosecond later
renders as '...:00.000000001Z' and lex-sorts before it. The
practical effect was that /logs could resolve the wrong run for
a workflow that had been triggered more than once.
Add a created_unix_ns INTEGER column to buildkite_builds, populate
it from time.Now().UnixNano() on insert, and switch the lookup to
ORDER BY created_unix_ns DESC with created_at and build_number as
deterministic tiebreakers for legacy rows that pre-date the column.
The migration path is covered: an additive ALTER widens existing
databases, and a one-shot Go-side backfill parses each row's
created_at and writes the corresponding UnixNano. Rows whose text
fails to parse are left at the default 0 so a single corrupt row
cannot wedge startup. New tests in store_migrate_test.go open a
hand-crafted pre-migration database through openStore and assert
the upgrade is correct, idempotent, and tolerant of bad data.
Spawn previously stored cfg.Org on the buildkite_builds row, leaving
empty when the workflow YAML didn't set tack.buildkite.org and
relying on the read path to fall back to the provider's defaultOrg.
That coupling let historical lookups drift: if defaultOrg ever
changed, log fetches and webhook joins for older builds would silently
target the wrong organisation.
Jetstream cursors are time-based and the upstream docs explicitly note
that exact-boundary replay across a disconnect is not guaranteed
gapless. Resuming from the precise saved TimeUS could therefore drop
events that straddle the reconnect window.
On every (re)connect, subtract a fixed jetstreamRewind (5s) from the
loaded cursor before handing it to ConnectAndRead, clamping at zero so
a tiny saved cursor can't go negative. The replayed events are safe to
re-apply: applyCommit dispatches only to UPSERTs and DELETEs keyed on
(did, rkey), so duplicates collapse into the same row state.
The /logs and /events handlers wrote frames with conn.WriteMessage
and never set a write deadline. A client that stopped reading but
kept the TCP connection open could fill the kernel send buffer and
park the handler goroutine on a write forever, leaking the request
context, the broker subscription, and the log producer.
Add a wsWriteWait constant (10s) and call SetWriteDeadline before
every WriteMessage in the logs drain loop and in streamEvents. The
keep-alive ping and the closing close-frame already used WriteControl,
which takes a deadline argument directly; raise their bounds from 1s
to wsWriteWait for consistency. A stuck peer now fails the next write
within ~10s and the handler unwinds cleanly.
A workflow can override the spindle's default Buildkite organisation
via `tack.buildkite.org`, but `BuildkiteBuildRef` didn't carry the org
field. Spawn used the override for `CreateBuild` and then dropped it,
so `Logs` always recomputed org := p.defaultOrg and any cross-org
workflow's /logs request 404'd against the wrong organisation.
Extend TestBuildkiteSpawnWorkflowConfig to assert the org survives
the round-trip via LookupBuildkiteBuildByTuple.
Adds an `extraServiceConfig` option to the NixOS module that is
merged into the systemd service's `serviceConfig` after the
module's defaults. This lets operators set arbitrary `[Service]`
settings, most notably resource limits like `MemoryMax` and
`CPUQuota`, without needing to fork the module, and also lets
them override any of the defaults we set out of the box (e.g.
to relax a sandboxing knob).
Implemented as `attrsOf unspecified` merged with `//` so the
user's attrs win on conflict.
Previously, `handleJetstreamEvent` saved the time-based cursor after
every event regardless of whether `applyCommit` succeeded. That is fine
for permanently bad records (malformed JSON, schema violations) where
replaying achieves nothing, but wrong for transient infra failures
(SQLite busy, store closed during shutdown, disk full): the cursor
would advance past a perfectly good event and silently drop the
membership or repo row that backs it, with no way to recover short of
a manual replay.
`applyCommit` now distinguishes the two classes via a new
`badRecordError` wrapper. JSON decode failures in `applySpindleMember`,
`applyRepo`, and `applyRepoCollaborator` are wrapped with
`badRecord(...)` so they remain cursor-advancing. Everything else
returned from `applyCommit` is treated as transient:
`handleJetstreamEvent` logs it, returns the error to the scheduler, and
skips `SaveCursor` so the next reconnect (which already rewinds by
`jetstreamRewind`) will redeliver and retry.
LookupBuildkiteBuildByTuple sorted on created_at, an RFC3339Nano
text column. Lexical comparison of nanosecond timestamps is not
reliable: time.Format trims trailing zeros, so an instant on the
exact second renders as '...:00Z' while one nanosecond later
renders as '...:00.000000001Z' and lex-sorts before it. The
practical effect was that /logs could resolve the wrong run for
a workflow that had been triggered more than once.
Add a created_unix_ns INTEGER column to buildkite_builds, populate
it from time.Now().UnixNano() on insert, and switch the lookup to
ORDER BY created_unix_ns DESC with created_at and build_number as
deterministic tiebreakers for legacy rows that pre-date the column.
The migration path is covered: an additive ALTER widens existing
databases, and a one-shot Go-side backfill parses each row's
created_at and writes the corresponding UnixNano. Rows whose text
fails to parse are left at the default 0 so a single corrupt row
cannot wedge startup. New tests in store_migrate_test.go open a
hand-crafted pre-migration database through openStore and assert
the upgrade is correct, idempotent, and tolerant of bad data.
Spawn previously stored cfg.Org on the buildkite_builds row, leaving
empty when the workflow YAML didn't set tack.buildkite.org and
relying on the read path to fall back to the provider's defaultOrg.
That coupling let historical lookups drift: if defaultOrg ever
changed, log fetches and webhook joins for older builds would silently
target the wrong organisation.
Jetstream cursors are time-based and the upstream docs explicitly note
that exact-boundary replay across a disconnect is not guaranteed
gapless. Resuming from the precise saved TimeUS could therefore drop
events that straddle the reconnect window.
On every (re)connect, subtract a fixed jetstreamRewind (5s) from the
loaded cursor before handing it to ConnectAndRead, clamping at zero so
a tiny saved cursor can't go negative. The replayed events are safe to
re-apply: applyCommit dispatches only to UPSERTs and DELETEs keyed on
(did, rkey), so duplicates collapse into the same row state.
The /logs and /events handlers wrote frames with conn.WriteMessage
and never set a write deadline. A client that stopped reading but
kept the TCP connection open could fill the kernel send buffer and
park the handler goroutine on a write forever, leaking the request
context, the broker subscription, and the log producer.
Add a wsWriteWait constant (10s) and call SetWriteDeadline before
every WriteMessage in the logs drain loop and in streamEvents. The
keep-alive ping and the closing close-frame already used WriteControl,
which takes a deadline argument directly; raise their bounds from 1s
to wsWriteWait for consistency. A stuck peer now fails the next write
within ~10s and the handler unwinds cleanly.
A workflow can override the spindle's default Buildkite organisation
via `tack.buildkite.org`, but `BuildkiteBuildRef` didn't carry the org
field. Spawn used the override for `CreateBuild` and then dropped it,
so `Logs` always recomputed org := p.defaultOrg and any cross-org
workflow's /logs request 404'd against the wrong organisation.
Extend TestBuildkiteSpawnWorkflowConfig to assert the org survives
the round-trip via LookupBuildkiteBuildByTuple.
Adds an `extraServiceConfig` option to the NixOS module that is
merged into the systemd service's `serviceConfig` after the
module's defaults. This lets operators set arbitrary `[Service]`
settings, most notably resource limits like `MemoryMax` and
`CPUQuota`, without needing to fork the module, and also lets
them override any of the defaults we set out of the box (e.g.
to relax a sandboxing knob).
Implemented as `attrsOf unspecified` merged with `//` so the
user's attrs win on conflict.