Monorepo for Tangled tangled.org
8

Configure Feed

Select the types of activity you want to include in your feed.

at dwn/spindle-delegate 235 lines 12 kB View raw View rendered
1# spindle microVM engine 2 3This document describes the architecture of the microvm engine for spindle. In 4short it allows the spindle to spin up microvm guests, and implements a guest 5[agent protocol](../../agentproto) for communicating with those guests (via the 6[shuttle](../../../shuttle) implementation of that proto). It implements some 7fairly simple resource budgeting and optionally sets up cgroups for better 8enforcing resource limits, and hardens the VM network access. It has Nix cache 9integration for any paths built in the VM, those will get pushed to a Nix cache 10by the spindle (if one is configured). The runner is abstracted behind an 11interface; right now only the QEMU microVM impl is supported, but others (e.g. 12firecracker) can slot in later. 13 14Currently two kinds of images are supported: 15 16- NixOS images: these allow configuration such as `dependencies`, `services`, 17 `virtualisation`, `registry`, `caches` in the workflow file itself. The guest 18 agent will build (or if it's cached, spindle will send the store path for 19 realization) and activate it before any workflow steps are ran. 20- Non-NixOS: this is mainly just Alpine for now, but can be anything else. 21 Workflow-level configuration like NixOS aren't supported while using these. If 22 Nix exists inside the image (like in our Alpine image) it will still be able 23 to make use of the spindle cache. 24 25(For testing, you can run `bash spindle/engines/microvm/test-spindle-microvm.sh` 26from repo root. These test the Alpine & NixOS, and features like if Docker 27works, public internet is reachable, and so on.) 28 29## Image builds 30 31Image builds right now are done via Nix: 32 33- For NixOS, we use [microvm.nix](https://github.com/microvm-nix/microvm.nix), 34 and layer our own configs on-top, see [here](../../../nix/microvm). 35- For Alpine we have a small-ish Nix definition that includes fetching the 36 kernel, initrd, kernel modules; setting up the init script that configures the 37 VM proper; copying dependencies (like `nix` or `git`) into a rootfs and 38 creating a squashfs from it. 39 40This does not mean it *has* to be done via Nix, as long as your images are what 41spindle expects, they should work. That is: 42- a guest agent is present inside of the image and when that image boots it will 43 get started, 44- `spindle-workflow` user exists, 45- and the work directory is configured (`/workspace`). 46 47## Image discovery 48 49Each built image ships with a `spec.json` next to its artifacts. This spec 50describes everything needed to run the image: the kernel, initrd and read-only 51store disk paths, boot args, memory/vCPU sizing, the shell used for workflow 52steps, writable volumes, network interfaces, and runner-specific config (machine 53type, CPU, extra args for QEMU). NixOS images also carry a `baseConfigHash` 54identifying the base configuration baked into the image. 55 56An image lives in the configured image directory either as a directory 57containing a `spec.json` (alongside the kernel/initrd/store-disk artifacts) or, 58for a self-contained spec, as a flat `<name>.json` file. An operator keeping 59multiple arches side by side can name them `<name>-<arch>` (eg. `nixos-x86_64`, 60`alpine-aarch64`); that arch suffix is just part of the name, not something 61resolution infers. 62 63A workflow names an image with the `image` key at top-level (falling back to 64`SPINDLE_MICROVM_PIPELINES_DEFAULT_IMAGE` if unset). The name is matched 65literally: we look for `<name>` (a directory with a `spec.json`) then 66`<name>.json`. Resolution depends only on the name and what is on disk, never on 67the host, so the same workflow resolves identically on every spindle. If for 68example an operator wants `nixos` to work, they can symlink `nixos` to 69`nixos-x86_64`. 70 71The spec is validated at resolve time (required fields, positive sizes etc.), 72and right before launch we also check the referenced files actually exist on 73disk and that the host has the commands we need: `mkfs.ext4` for volume 74formatting, plus whatever the selected runner requires. For QEMU that's the QEMU 75binary for the spec's arch, `/dev/vhost-vsock`, `/dev/kvm` (if KVM is enabled), 76and the `ip`, `mount`, `slirp4netns`, `unshare` toolchain when the image has 77network interfaces. 78 79## microVM lifecycle 80 81```mermaid 82flowchart LR 83 Init["InitWorkflow<br/><small>parse manifest, resolve image, build steps</small>"] 84 Acquire["AcquireWorkflowSlot<br/><small>queue until resources fit budget</small>"] 85 Setup["SetupWorkflow<br/><small>proxies, VM, agent handshake</small>"] 86 Run["RunStep ×N<br/><small>exec via agent</small>"] 87 Destroy["DestroyWorkflow<br/><small>drain cache, poweroff, cleanup</small>"] 88 89 Init --> Acquire --> Setup --> Run --> Destroy 90``` 91 92While a workflow is running, things look like this (everything inside the cgroup 93box is what gets resource-limited): 94 95```mermaid 96flowchart LR 97 subgraph Host["spindle host"] 98 Hub["agent hub"] 99 ReadProxy["read cache proxy"] 100 UploadProxy["upload cache proxy"] 101 subgraph Cgroup["per-workflow cgroup"] 102 QEMU["qemu"] 103 Slirp["slirp4netns"] 104 end 105 end 106 107 subgraph Guest["guest"] 108 Agent["guest agent"] 109 end 110 111 Agent -->|"vsock"| Hub 112 Agent -->|substitutions| ReadProxy 113 Agent -->|built paths| UploadProxy 114 QEMU --- Guest 115 Slirp -->|outbound only| Internet["the internet"] 116 ReadProxy --> Substituters["upstream caches"] 117 UploadProxy --> NixCache["spindle nix cache"] 118``` 119 120`InitWorkflow` parses the workflow manifest, resolves the image, and assembles 121the step list: the clone step first, then (for NixOS images with a workflow 122config) a "NixOS config activation" system step, then the user steps. Before any 123of this actually runs the workflow has to acquire a slot from the resource 124scheduler, each image declares its memory/vCPUs/disk and workflows queue until 125their request fits within the configured budget. The scheduler is 126work-conserving with aging and per-user fairness, so one user submitting a pile 127of jobs won't starve everyone else, and slots don't sit idle while there's 128queued work that fits in the budget. 129 130### Configuration 131 132Setup allocates a random vsock CID for the guest and registers it with the agent 133hub, which listens on a single host vsock port. Incoming agent connections are 134matched to workflows by CID, anything with an unknown CID is dropped. It then 135creates a per-workflow work directory and starts three host-side proxies the guest 136reaches over vsock: a read cache proxy (fronting the configured Nix substituters 137plus any workflow-level `caches`) and an upload cache proxy (for pushing paths 138built in the guest to the spindle's cache), plus a DNS proxy that resolves 139through the host's resolver and filters private/special-purpose address answers. 140 141Then the VM itself. Writable volumes from the spec are created as sparse files 142and formatted ext4, the store disk is attached read-only. QEMU runs with 143`-sandbox on`, `-nodefaults`, no display/monitor, etc., serial output to a log 144file, and a QMP socket for control. 145 146For network hardening: if the image has network interfaces, QEMU doesn't run in 147the host network namespace at all. We `unshare` into fresh user/net/mount 148namespaces, and a small wrapper script inside the namespace bind-mounts a 149resolv.conf that disables qemu's slirp DNS and adds blackhole routes for every 150special-use IPv4/IPv6 range (RFC 6890, so private networks, link-local, 151loopback, CGNAT, multicast, ULAs and so on) before exec'ing QEMU. `slirp4netns` 152(with `--disable-host-loopback`, sandbox and seccomp enabled) then provides 153outbound connectivity for the namespace. The guest's `/etc/resolv.conf` points 154at shuttle on localhost; shuttle forwards DNS packets over vsock to the 155host-side DNS proxy. The guest sits behind a second layer of QEMU user-mode 156networking inside that namespace, so guest traffic can only ever reach the 157outside world, never the host or anything on its local networks. 158 159Optionally the whole thing (QEMU and slirp4netns) is placed in a per-workflow 160cgroup with memory, swap and pids limits, so the budget above is actually 161enforced and not just bookkeeping. That also allows us to, for example, if the 162cgroup OOM-kills the VM we can detect that and report it as such instead of a 163generic crash. The spindle supervisor itself also gets a cgroup with a 164protected `memory.min`, so under host memory pressure it's the workflows that 165get OOM-killed first, not spindle. 166 167### Boot - run - death 168 169Once QEMU is up we poll the QMP socket until it accepts a connection and reports 170the guest as running, then wait for the guest agent to send handshake message 171over vsock from the expected CID. It reports its protocol and versions, and 172spindle sends it the job id, trusted cache public keys, and the cache/DNS proxy 173ports. 174 175First the activation step is ran (if on a NixOS image and the workflow is 176configured with anything), spindle sends the user config (or a cached toplevel 177store path, if we've built this exact base + config combo before) and the agent 178builds and activates it before the user steps run. Afterwards, each step is sent 179as an exec request (`$shell -lc <command>` as an unprivileged workflow user in 180`/workspace/repo`, with workflow/step environment and unlocked secrets), and 181stdout/stderr stream back as messages until an exit message arrives. Timeouts 182are cooperative: we derive a deadline from the workflow timeout and ship it to 183the guest, with a little grace on the host side so the guest gets to report the 184timeout itself. While a step runs we also watch for the VM crashing, if it does 185we tail the serial (and qemu) logs into the step's stderr so you get something 186more useful than "guest agent connection lost: EOF". 187 188Teardown is same whether the workflow succeeded, failed or timed out: drain the 189guest's pending Nix cache uploads, ask the agent to power off and wait for QEMU 190to exit (falling back to QMP `system_powerdown` and finally a kill if it 191doesn't), then close the proxies and remove the work directory. For non-HTTP 192upload targets the host-side import already happened synchronously when the 193guest committed each narinfo, so there is no second host-side cache drain step 194at teardown. 195 196### Nix cache 197 198The two host-side proxies are how the guest talks to spindle's Nix cache without 199ever needing credentials or direct network access; like the agent they reach the 200host over vsock. 201 202The read proxy fronts the configured substituters plus any workflow-level 203`caches`. When the guest needs to realize a store path it asks the proxy, which 204queries the read caches concurrently and returns the first successful response, 205with a 404 only winning if every upstream returns 404. 206 207The upload proxy goes the other way: paths built inside the guest are pushed to 208spindle's configured upload cache (if any) so the next workflow that needs them 209doesn't rebuild. Paths already present on any configured read cache are skipped. 210 211For `http://` and `https://` upload targets the proxy just reverse-proxies the 212guest's binary-cache upload traffic to the configured remote cache, while still 213answering narinfo existence checks across the upload target plus the read 214caches. 215 216For `ssh://`, `ssh-ng://`, `daemon`, and `local` targets spindle implements the 217small HTTP binary-cache upload surface itself. It stages uploaded `nar/` objects 218and narinfos under the workflow workdir, validates the narinfo, then treats the 219narinfo upload as the commit point: once `<hash>.narinfo` is written spindle 220runs: 221 222```bash 223nix copy \ 224 --from file://<staging-dir> \ 225 --to <target-store> \ 226 --no-check-sigs \ 227 --substitute-on-destination \ 228 <store-path> 229``` 230 231That copy is synchronous. If it fails, spindle removes the staged narinfo again 232so future `GET`/`HEAD <hash>.narinfo` requests do not falsely dedupe a path that 233never made it to the destination store. The guest still only ever sees the same 234HTTP binary-cache upload protocol over vsock; it never gets direct access to 235SSH credentials or the destination store itself.