Commits
The consumer code in jetstream.go mixed Tack-specific concerns
(collections, applyCommit, store cursor) with generic firehose
mechanics: configuring the upstream client, looping reconnects,
rewinding time-based cursors, persisting cursor progress, and
distinguishing permanent bad-record failures from transient
handler failures. None of those are specific to Tack and they are
the bits most likely to be reused for future jetstream consumers.
Move the generic mechanics into a new internal/jetstream package.
Previously, observing a `sh.tangled.repo` record whose spindle field
matched our hostname was enough to enroll a new knot subscription, no
matter who published the record.
`reconcileKnot` now consults a new `IsAuthorizedActor` helper before
calling `AddKnot`: the repo's publisher DID must be the spindle owner
or have been vouched for by a `sh.tangled.spindle.member` record
whose own publisher is the owner.
VerifySignature previously accepted any cryptographically valid
"timestamp=<unix>,signature=<hex>" header regardless of how old the
timestamp was. An attacker who captured a single signed delivery
could replay it indefinitely, creating duplicate status events and
unbounded growth in the events table.
Reject signatures whose timestamp is more than MaxSignatureAge (5
minutes) from the local clock in either direction. The symmetric
bound also defeats implausibly future-dated stamps that would
otherwise mint a long replay window. The clock is read through a
package-level timeNow var so tests can pin it deterministically; the
existing fixed-timestamp test now stubs the clock and a new stale
case covers the rejection path.
Previously `knot.go` executed every `sh.tangled.pipeline` event the
moment it arrived, ignoring the `spindle_members` and `repos`
tables that `jetstream.go` has been mirroring from the AT Proto
firehose.
The knot consumer now consults `store.AuthorizePipelineActor`
before dispatching a trigger. The check has two gates: the
triggers repo must have published a `sh.tangled.repo` record
naming us as its spindle on the knot the event arrived from, and
the publisher of that repo record must be either the spindle
owner or a subject the owner vouched for via
`sh.tangled.spindle.member`.
Buildkite webhooks can land before Spawn's goroutine has persisted
the build UUID to (knot, pipeline_rkey, workflow) mapping. The
window is small but real: CreateBuild returns, Buildkite fires
build.scheduled, the webhook handler runs LookupBuildkiteBuildByUUID
and gets nothing, and the event is dropped on the floor forever.
HandleWebhook now reconstructs the ref from the build's meta_data
when the lookup misses. We already attach tack:knot, tack:pipeline_rkey,
and tack:workflow at CreateBuild time, so a Buildkite-originated
webhook for one of our builds always carries enough information to
recover the tuple. Org and pipeline slug come from the payload's
top-level organization and embedded build.pipeline objects.
The reconstructed ref is opportunistically inserted via the existing
ON CONFLICT DO UPDATE, so subsequent webhooks and any /logs request
hit the cache instead of redoing the meta_data dance. If the
authoritative Spawn-side insert lands afterwards it just refreshes
the row.
Builds without our tack:* meta_data still no-op, preserving the
'foreign build sharing this webhook URL' behavior. WebhookPayload
gains an Organization field so the org slug is available without
poking at Build.Pipeline.
Previously, `handleJetstreamEvent` saved the time-based cursor after
every event regardless of whether `applyCommit` succeeded. That is fine
for permanently bad records (malformed JSON, schema violations) where
replaying achieves nothing, but wrong for transient infra failures
(SQLite busy, store closed during shutdown, disk full): the cursor
would advance past a perfectly good event and silently drop the
membership or repo row that backs it, with no way to recover short of
a manual replay.
`applyCommit` now distinguishes the two classes via a new
`badRecordError` wrapper. JSON decode failures in `applySpindleMember`,
`applyRepo`, and `applyRepoCollaborator` are wrapped with
`badRecord(...)` so they remain cursor-advancing. Everything else
returned from `applyCommit` is treated as transient:
`handleJetstreamEvent` logs it, returns the error to the scheduler, and
skips `SaveCursor` so the next reconnect (which already rewinds by
`jetstreamRewind`) will redeliver and retry.
LookupBuildkiteBuildByTuple sorted on created_at, an RFC3339Nano
text column. Lexical comparison of nanosecond timestamps is not
reliable: time.Format trims trailing zeros, so an instant on the
exact second renders as '...:00Z' while one nanosecond later
renders as '...:00.000000001Z' and lex-sorts before it. The
practical effect was that /logs could resolve the wrong run for
a workflow that had been triggered more than once.
Add a created_unix_ns INTEGER column to buildkite_builds, populate
it from time.Now().UnixNano() on insert, and switch the lookup to
ORDER BY created_unix_ns DESC with created_at and build_number as
deterministic tiebreakers for legacy rows that pre-date the column.
The migration path is covered: an additive ALTER widens existing
databases, and a one-shot Go-side backfill parses each row's
created_at and writes the corresponding UnixNano. Rows whose text
fails to parse are left at the default 0 so a single corrupt row
cannot wedge startup. New tests in store_migrate_test.go open a
hand-crafted pre-migration database through openStore and assert
the upgrade is correct, idempotent, and tolerant of bad data.
Spawn previously stored cfg.Org on the buildkite_builds row, leaving
empty when the workflow YAML didn't set tack.buildkite.org and
relying on the read path to fall back to the provider's defaultOrg.
That coupling let historical lookups drift: if defaultOrg ever
changed, log fetches and webhook joins for older builds would silently
target the wrong organisation.
Jetstream cursors are time-based and the upstream docs explicitly note
that exact-boundary replay across a disconnect is not guaranteed
gapless. Resuming from the precise saved TimeUS could therefore drop
events that straddle the reconnect window.
On every (re)connect, subtract a fixed jetstreamRewind (5s) from the
loaded cursor before handing it to ConnectAndRead, clamping at zero so
a tiny saved cursor can't go negative. The replayed events are safe to
re-apply: applyCommit dispatches only to UPSERTs and DELETEs keyed on
(did, rkey), so duplicates collapse into the same row state.
The /logs and /events handlers wrote frames with conn.WriteMessage
and never set a write deadline. A client that stopped reading but
kept the TCP connection open could fill the kernel send buffer and
park the handler goroutine on a write forever, leaking the request
context, the broker subscription, and the log producer.
Add a wsWriteWait constant (10s) and call SetWriteDeadline before
every WriteMessage in the logs drain loop and in streamEvents. The
keep-alive ping and the closing close-frame already used WriteControl,
which takes a deadline argument directly; raise their bounds from 1s
to wsWriteWait for consistency. A stuck peer now fails the next write
within ~10s and the handler unwinds cleanly.
A workflow can override the spindle's default Buildkite organisation
via `tack.buildkite.org`, but `BuildkiteBuildRef` didn't carry the org
field. Spawn used the override for `CreateBuild` and then dropped it,
so `Logs` always recomputed org := p.defaultOrg and any cross-org
workflow's /logs request 404'd against the wrong organisation.
Extend TestBuildkiteSpawnWorkflowConfig to assert the org survives
the round-trip via LookupBuildkiteBuildByTuple.
Adds an `extraServiceConfig` option to the NixOS module that is
merged into the systemd service's `serviceConfig` after the
module's defaults. This lets operators set arbitrary `[Service]`
settings, most notably resource limits like `MemoryMax` and
`CPUQuota`, without needing to fork the module, and also lets
them override any of the defaults we set out of the box (e.g.
to relax a sandboxing knob).
Implemented as `attrsOf unspecified` merged with `//` so the
user's attrs win on conflict.
The consumer code in jetstream.go mixed Tack-specific concerns
(collections, applyCommit, store cursor) with generic firehose
mechanics: configuring the upstream client, looping reconnects,
rewinding time-based cursors, persisting cursor progress, and
distinguishing permanent bad-record failures from transient
handler failures. None of those are specific to Tack and they are
the bits most likely to be reused for future jetstream consumers.
Move the generic mechanics into a new internal/jetstream package.
Previously, observing a `sh.tangled.repo` record whose spindle field
matched our hostname was enough to enroll a new knot subscription, no
matter who published the record.
`reconcileKnot` now consults a new `IsAuthorizedActor` helper before
calling `AddKnot`: the repo's publisher DID must be the spindle owner
or have been vouched for by a `sh.tangled.spindle.member` record
whose own publisher is the owner.
VerifySignature previously accepted any cryptographically valid
"timestamp=<unix>,signature=<hex>" header regardless of how old the
timestamp was. An attacker who captured a single signed delivery
could replay it indefinitely, creating duplicate status events and
unbounded growth in the events table.
Reject signatures whose timestamp is more than MaxSignatureAge (5
minutes) from the local clock in either direction. The symmetric
bound also defeats implausibly future-dated stamps that would
otherwise mint a long replay window. The clock is read through a
package-level timeNow var so tests can pin it deterministically; the
existing fixed-timestamp test now stubs the clock and a new stale
case covers the rejection path.
Previously `knot.go` executed every `sh.tangled.pipeline` event the
moment it arrived, ignoring the `spindle_members` and `repos`
tables that `jetstream.go` has been mirroring from the AT Proto
firehose.
The knot consumer now consults `store.AuthorizePipelineActor`
before dispatching a trigger. The check has two gates: the
triggers repo must have published a `sh.tangled.repo` record
naming us as its spindle on the knot the event arrived from, and
the publisher of that repo record must be either the spindle
owner or a subject the owner vouched for via
`sh.tangled.spindle.member`.
Buildkite webhooks can land before Spawn's goroutine has persisted
the build UUID to (knot, pipeline_rkey, workflow) mapping. The
window is small but real: CreateBuild returns, Buildkite fires
build.scheduled, the webhook handler runs LookupBuildkiteBuildByUUID
and gets nothing, and the event is dropped on the floor forever.
HandleWebhook now reconstructs the ref from the build's meta_data
when the lookup misses. We already attach tack:knot, tack:pipeline_rkey,
and tack:workflow at CreateBuild time, so a Buildkite-originated
webhook for one of our builds always carries enough information to
recover the tuple. Org and pipeline slug come from the payload's
top-level organization and embedded build.pipeline objects.
The reconstructed ref is opportunistically inserted via the existing
ON CONFLICT DO UPDATE, so subsequent webhooks and any /logs request
hit the cache instead of redoing the meta_data dance. If the
authoritative Spawn-side insert lands afterwards it just refreshes
the row.
Builds without our tack:* meta_data still no-op, preserving the
'foreign build sharing this webhook URL' behavior. WebhookPayload
gains an Organization field so the org slug is available without
poking at Build.Pipeline.
Previously, `handleJetstreamEvent` saved the time-based cursor after
every event regardless of whether `applyCommit` succeeded. That is fine
for permanently bad records (malformed JSON, schema violations) where
replaying achieves nothing, but wrong for transient infra failures
(SQLite busy, store closed during shutdown, disk full): the cursor
would advance past a perfectly good event and silently drop the
membership or repo row that backs it, with no way to recover short of
a manual replay.
`applyCommit` now distinguishes the two classes via a new
`badRecordError` wrapper. JSON decode failures in `applySpindleMember`,
`applyRepo`, and `applyRepoCollaborator` are wrapped with
`badRecord(...)` so they remain cursor-advancing. Everything else
returned from `applyCommit` is treated as transient:
`handleJetstreamEvent` logs it, returns the error to the scheduler, and
skips `SaveCursor` so the next reconnect (which already rewinds by
`jetstreamRewind`) will redeliver and retry.
LookupBuildkiteBuildByTuple sorted on created_at, an RFC3339Nano
text column. Lexical comparison of nanosecond timestamps is not
reliable: time.Format trims trailing zeros, so an instant on the
exact second renders as '...:00Z' while one nanosecond later
renders as '...:00.000000001Z' and lex-sorts before it. The
practical effect was that /logs could resolve the wrong run for
a workflow that had been triggered more than once.
Add a created_unix_ns INTEGER column to buildkite_builds, populate
it from time.Now().UnixNano() on insert, and switch the lookup to
ORDER BY created_unix_ns DESC with created_at and build_number as
deterministic tiebreakers for legacy rows that pre-date the column.
The migration path is covered: an additive ALTER widens existing
databases, and a one-shot Go-side backfill parses each row's
created_at and writes the corresponding UnixNano. Rows whose text
fails to parse are left at the default 0 so a single corrupt row
cannot wedge startup. New tests in store_migrate_test.go open a
hand-crafted pre-migration database through openStore and assert
the upgrade is correct, idempotent, and tolerant of bad data.
Spawn previously stored cfg.Org on the buildkite_builds row, leaving
empty when the workflow YAML didn't set tack.buildkite.org and
relying on the read path to fall back to the provider's defaultOrg.
That coupling let historical lookups drift: if defaultOrg ever
changed, log fetches and webhook joins for older builds would silently
target the wrong organisation.
Jetstream cursors are time-based and the upstream docs explicitly note
that exact-boundary replay across a disconnect is not guaranteed
gapless. Resuming from the precise saved TimeUS could therefore drop
events that straddle the reconnect window.
On every (re)connect, subtract a fixed jetstreamRewind (5s) from the
loaded cursor before handing it to ConnectAndRead, clamping at zero so
a tiny saved cursor can't go negative. The replayed events are safe to
re-apply: applyCommit dispatches only to UPSERTs and DELETEs keyed on
(did, rkey), so duplicates collapse into the same row state.
The /logs and /events handlers wrote frames with conn.WriteMessage
and never set a write deadline. A client that stopped reading but
kept the TCP connection open could fill the kernel send buffer and
park the handler goroutine on a write forever, leaking the request
context, the broker subscription, and the log producer.
Add a wsWriteWait constant (10s) and call SetWriteDeadline before
every WriteMessage in the logs drain loop and in streamEvents. The
keep-alive ping and the closing close-frame already used WriteControl,
which takes a deadline argument directly; raise their bounds from 1s
to wsWriteWait for consistency. A stuck peer now fails the next write
within ~10s and the handler unwinds cleanly.
A workflow can override the spindle's default Buildkite organisation
via `tack.buildkite.org`, but `BuildkiteBuildRef` didn't carry the org
field. Spawn used the override for `CreateBuild` and then dropped it,
so `Logs` always recomputed org := p.defaultOrg and any cross-org
workflow's /logs request 404'd against the wrong organisation.
Extend TestBuildkiteSpawnWorkflowConfig to assert the org survives
the round-trip via LookupBuildkiteBuildByTuple.
Adds an `extraServiceConfig` option to the NixOS module that is
merged into the systemd service's `serviceConfig` after the
module's defaults. This lets operators set arbitrary `[Service]`
settings, most notably resource limits like `MemoryMax` and
`CPUQuota`, without needing to fork the module, and also lets
them override any of the defaults we set out of the box (e.g.
to relax a sandboxing knob).
Implemented as `attrsOf unspecified` merged with `//` so the
user's attrs win on conflict.