spindle/engines/microvm/README.md at dwn/spindle-delegate · tangled.org/core

tangled.org / core
Fork 66
Monorepo for Tangled tangled.org
Fork 66
core / spindle / engines / microvm / README.md
at dwn/spindle-delegate 235 lines 12 kB View raw View rendered
wrap content
dawn spindle/microvm: support ssh/ssh-ng and local nix stores as upload cache target 7d ago
07598d9e
  1# spindle microVM engine
  2
  3This document describes the architecture of the microvm engine for spindle. In
  4short it allows the spindle to spin up microvm guests, and implements a guest
  5[agent protocol](../../agentproto) for communicating with those guests (via the
  6[shuttle](../../../shuttle) implementation of that proto). It implements some
  7fairly simple resource budgeting and optionally sets up cgroups for better
  8enforcing resource limits, and hardens the VM network access. It has Nix cache
  9integration for any paths built in the VM, those will get pushed to a Nix cache
 10by the spindle (if one is configured). The runner is abstracted behind an
 11interface; right now only the QEMU microVM impl is supported, but others (e.g.
 12firecracker) can slot in later.
 13
 14Currently two kinds of images are supported:
 15
 16- NixOS images: these allow configuration such as `dependencies`, `services`,
 17  `virtualisation`, `registry`, `caches` in the workflow file itself. The guest
 18  agent will build (or if it's cached, spindle will send the store path for
 19  realization) and activate it before any workflow steps are ran.
 20- Non-NixOS: this is mainly just Alpine for now, but can be anything else.
 21  Workflow-level configuration like NixOS aren't supported while using these. If
 22  Nix exists inside the image (like in our Alpine image) it will still be able
 23  to make use of the spindle cache.
 24
 25(For testing, you can run `bash spindle/engines/microvm/test-spindle-microvm.sh`
 26from repo root. These test the Alpine & NixOS, and features like if Docker
 27works, public internet is reachable, and so on.)
 28
 29## Image builds
 30
 31Image builds right now are done via Nix:
 32
 33- For NixOS, we use [microvm.nix](https://github.com/microvm-nix/microvm.nix),
 34  and layer our own configs on-top, see [here](../../../nix/microvm).
 35- For Alpine we have a small-ish Nix definition that includes fetching the
 36  kernel, initrd, kernel modules; setting up the init script that configures the
 37  VM proper; copying dependencies (like `nix` or `git`) into a rootfs and
 38  creating a squashfs from it.
 39
 40This does not mean it *has* to be done via Nix, as long as your images are what
 41spindle expects, they should work. That is:
 42- a guest agent is present inside of the image and when that image boots it will
 43  get started,
 44- `spindle-workflow` user exists,
 45- and the work directory is configured (`/workspace`).
 46
 47## Image discovery
 48
 49Each built image ships with a `spec.json` next to its artifacts. This spec
 50describes everything needed to run the image: the kernel, initrd and read-only
 51store disk paths, boot args, memory/vCPU sizing, the shell used for workflow
 52steps, writable volumes, network interfaces, and runner-specific config (machine
 53type, CPU, extra args for QEMU). NixOS images also carry a `baseConfigHash`
 54identifying the base configuration baked into the image.
 55
 56An image lives in the configured image directory either as a directory
 57containing a `spec.json` (alongside the kernel/initrd/store-disk artifacts) or,
 58for a self-contained spec, as a flat `<name>.json` file. An operator keeping
 59multiple arches side by side can name them `<name>-<arch>` (eg. `nixos-x86_64`,
 60`alpine-aarch64`); that arch suffix is just part of the name, not something
 61resolution infers.
 62
 63A workflow names an image with the `image` key at top-level (falling back to
 64`SPINDLE_MICROVM_PIPELINES_DEFAULT_IMAGE` if unset). The name is matched
 65literally: we look for `<name>` (a directory with a `spec.json`) then
 66`<name>.json`. Resolution depends only on the name and what is on disk, never on
 67the host, so the same workflow resolves identically on every spindle. If for
 68example an operator wants `nixos` to work, they can symlink `nixos` to
 69`nixos-x86_64`.
 70
 71The spec is validated at resolve time (required fields, positive sizes etc.),
 72and right before launch we also check the referenced files actually exist on
 73disk and that the host has the commands we need: `mkfs.ext4` for volume
 74formatting, plus whatever the selected runner requires. For QEMU that's the QEMU
 75binary for the spec's arch, `/dev/vhost-vsock`, `/dev/kvm` (if KVM is enabled),
 76and the `ip`, `mount`, `slirp4netns`, `unshare` toolchain when the image has
 77network interfaces.
 78
 79## microVM lifecycle
 80
 81```mermaid
 82flowchart LR
 83    Init["InitWorkflow<br/><small>parse manifest, resolve image, build steps</small>"]
 84    Acquire["AcquireWorkflowSlot<br/><small>queue until resources fit budget</small>"]
 85    Setup["SetupWorkflow<br/><small>proxies, VM, agent handshake</small>"]
 86    Run["RunStep ×N<br/><small>exec via agent</small>"]
 87    Destroy["DestroyWorkflow<br/><small>drain cache, poweroff, cleanup</small>"]
 88
 89    Init --> Acquire --> Setup --> Run --> Destroy
 90```
 91
 92While a workflow is running, things look like this (everything inside the cgroup
 93box is what gets resource-limited):
 94
 95```mermaid
 96flowchart LR
 97    subgraph Host["spindle host"]
 98        Hub["agent hub"]
 99        ReadProxy["read cache proxy"]
100        UploadProxy["upload cache proxy"]
101        subgraph Cgroup["per-workflow cgroup"]
102            QEMU["qemu"]
103            Slirp["slirp4netns"]
104        end
105    end
106
107    subgraph Guest["guest"]
108        Agent["guest agent"]
109    end
110
111    Agent -->|"vsock"| Hub
112    Agent -->|substitutions| ReadProxy
113    Agent -->|built paths| UploadProxy
114    QEMU --- Guest
115    Slirp -->|outbound only| Internet["the internet"]
116    ReadProxy --> Substituters["upstream caches"]
117    UploadProxy --> NixCache["spindle nix cache"]
118```
119
120`InitWorkflow` parses the workflow manifest, resolves the image, and assembles
121the step list: the clone step first, then (for NixOS images with a workflow
122config) a "NixOS config activation" system step, then the user steps. Before any
123of this actually runs the workflow has to acquire a slot from the resource
124scheduler, each image declares its memory/vCPUs/disk and workflows queue until
125their request fits within the configured budget. The scheduler is
126work-conserving with aging and per-user fairness, so one user submitting a pile
127of jobs won't starve everyone else, and slots don't sit idle while there's
128queued work that fits in the budget.
129
130### Configuration
131
132Setup allocates a random vsock CID for the guest and registers it with the agent
133hub, which listens on a single host vsock port. Incoming agent connections are
134matched to workflows by CID, anything with an unknown CID is dropped. It then
135creates a per-workflow work directory and starts three host-side proxies the guest
136reaches over vsock: a read cache proxy (fronting the configured Nix substituters
137plus any workflow-level `caches`) and an upload cache proxy (for pushing paths
138built in the guest to the spindle's cache), plus a DNS proxy that resolves
139through the host's resolver and filters private/special-purpose address answers.
140
141Then the VM itself. Writable volumes from the spec are created as sparse files
142and formatted ext4, the store disk is attached read-only. QEMU runs with
143`-sandbox on`, `-nodefaults`, no display/monitor, etc., serial output to a log
144file, and a QMP socket for control.
145
146For network hardening: if the image has network interfaces, QEMU doesn't run in
147the host network namespace at all. We `unshare` into fresh user/net/mount
148namespaces, and a small wrapper script inside the namespace bind-mounts a
149resolv.conf that disables qemu's slirp DNS and adds blackhole routes for every
150special-use IPv4/IPv6 range (RFC 6890, so private networks, link-local,
151loopback, CGNAT, multicast, ULAs and so on) before exec'ing QEMU. `slirp4netns`
152(with `--disable-host-loopback`, sandbox and seccomp enabled) then provides
153outbound connectivity for the namespace. The guest's `/etc/resolv.conf` points
154at shuttle on localhost; shuttle forwards DNS packets over vsock to the
155host-side DNS proxy. The guest sits behind a second layer of QEMU user-mode
156networking inside that namespace, so guest traffic can only ever reach the
157outside world, never the host or anything on its local networks.
158
159Optionally the whole thing (QEMU and slirp4netns) is placed in a per-workflow
160cgroup with memory, swap and pids limits, so the budget above is actually
161enforced and not just bookkeeping. That also allows us to, for example, if the
162cgroup OOM-kills the VM we can detect that and report it as such instead of a
163generic crash. The spindle supervisor itself also gets a cgroup with a
164protected `memory.min`, so under host memory pressure it's the workflows that
165get OOM-killed first, not spindle.
166
167### Boot - run - death
168
169Once QEMU is up we poll the QMP socket until it accepts a connection and reports
170the guest as running, then wait for the guest agent to send handshake message
171over vsock from the expected CID. It reports its protocol and versions, and
172spindle sends it the job id, trusted cache public keys, and the cache/DNS proxy
173ports.
174
175First the activation step is ran (if on a NixOS image and the workflow is
176configured with anything), spindle sends the user config (or a cached toplevel
177store path, if we've built this exact base + config combo before) and the agent
178builds and activates it before the user steps run. Afterwards, each step is sent
179as an exec request (`$shell -lc <command>` as an unprivileged workflow user in
180`/workspace/repo`, with workflow/step environment and unlocked secrets), and
181stdout/stderr stream back as messages until an exit message arrives. Timeouts
182are cooperative: we derive a deadline from the workflow timeout and ship it to
183the guest, with a little grace on the host side so the guest gets to report the
184timeout itself. While a step runs we also watch for the VM crashing, if it does
185we tail the serial (and qemu) logs into the step's stderr so you get something
186more useful than "guest agent connection lost: EOF".
187
188Teardown is same whether the workflow succeeded, failed or timed out: drain the
189guest's pending Nix cache uploads, ask the agent to power off and wait for QEMU
190to exit (falling back to QMP `system_powerdown` and finally a kill if it
191doesn't), then close the proxies and remove the work directory. For non-HTTP
192upload targets the host-side import already happened synchronously when the
193guest committed each narinfo, so there is no second host-side cache drain step
194at teardown.
195
196### Nix cache
197
198The two host-side proxies are how the guest talks to spindle's Nix cache without
199ever needing credentials or direct network access; like the agent they reach the
200host over vsock.
201
202The read proxy fronts the configured substituters plus any workflow-level
203`caches`. When the guest needs to realize a store path it asks the proxy, which
204queries the read caches concurrently and returns the first successful response,
205with a 404 only winning if every upstream returns 404.
206
207The upload proxy goes the other way: paths built inside the guest are pushed to
208spindle's configured upload cache (if any) so the next workflow that needs them
209doesn't rebuild. Paths already present on any configured read cache are skipped.
210
211For `http://` and `https://` upload targets the proxy just reverse-proxies the
212guest's binary-cache upload traffic to the configured remote cache, while still
213answering narinfo existence checks across the upload target plus the read
214caches.
215
216For `ssh://`, `ssh-ng://`, `daemon`, and `local` targets spindle implements the
217small HTTP binary-cache upload surface itself. It stages uploaded `nar/` objects
218and narinfos under the workflow workdir, validates the narinfo, then treats the
219narinfo upload as the commit point: once `<hash>.narinfo` is written spindle
220runs:
221
222```bash
223nix copy \
224  --from file://<staging-dir> \
225  --to <target-store> \
226  --no-check-sigs \
227  --substitute-on-destination \
228  <store-path>
229```
230
231That copy is synchronous. If it fails, spindle removes the staged narinfo again
232so future `GET`/`HEAD <hash>.narinfo` requests do not falsely dedupe a path that
233never made it to the destination store. The guest still only ever sees the same
234HTTP binary-cache upload protocol over vsock; it never gets direct access to
235SSH credentials or the destination store itself.
Configure Feed

Configure Feed