Monorepo for Tangled
tangled.org
1# spindle microVM engine
2
3This document describes the architecture of the microvm engine for spindle. In
4short it allows the spindle to spin up microvm guests, and implements a guest
5[agent protocol](../../agentproto) for communicating with those guests (via the
6[shuttle](../../../shuttle) implementation of that proto). It implements some
7fairly simple resource budgeting and optionally sets up cgroups for better
8enforcing resource limits, and hardens the VM network access. It has Nix cache
9integration for any paths built in the VM, those will get pushed to a Nix cache
10by the spindle (if one is configured). The runner is abstracted behind an
11interface; right now only the QEMU microVM impl is supported, but others (e.g.
12firecracker) can slot in later.
13
14Currently two kinds of images are supported:
15
16- NixOS images: these allow configuration such as `dependencies`, `services`,
17 `virtualisation`, `registry`, `caches` in the workflow file itself. The guest
18 agent will build (or if it's cached, spindle will send the store path for
19 realization) and activate it before any workflow steps are ran.
20- Non-NixOS: this is mainly just Alpine for now, but can be anything else.
21 Workflow-level configuration like NixOS aren't supported while using these. If
22 Nix exists inside the image (like in our Alpine image) it will still be able
23 to make use of the spindle cache.
24
25(For testing, you can run `bash spindle/engines/microvm/test-spindle-microvm.sh`
26from repo root. These test the Alpine & NixOS, and features like if Docker
27works, public internet is reachable, and so on.)
28
29## Image builds
30
31Image builds right now are done via Nix:
32
33- For NixOS, we use [microvm.nix](https://github.com/microvm-nix/microvm.nix),
34 and layer our own configs on-top, see [here](../../../nix/microvm).
35- For Alpine we have a small-ish Nix definition that includes fetching the
36 kernel, initrd, kernel modules; setting up the init script that configures the
37 VM proper; copying dependencies (like `nix` or `git`) into a rootfs and
38 creating a squashfs from it.
39
40This does not mean it *has* to be done via Nix, as long as your images are what
41spindle expects, they should work. That is:
42- a guest agent is present inside of the image and when that image boots it will
43 get started,
44- `spindle-workflow` user exists,
45- and the work directory is configured (`/workspace`).
46
47## Image discovery
48
49Each built image ships with a `spec.json` next to its artifacts. This spec
50describes everything needed to run the image: the kernel, initrd and read-only
51store disk paths, boot args, memory/vCPU sizing, the shell used for workflow
52steps, writable volumes, network interfaces, and runner-specific config (machine
53type, CPU, extra args for QEMU). NixOS images also carry a `baseConfigHash`
54identifying the base configuration baked into the image.
55
56An image lives in the configured image directory either as a directory
57containing a `spec.json` (alongside the kernel/initrd/store-disk artifacts) or,
58for a self-contained spec, as a flat `<name>.json` file. An operator keeping
59multiple arches side by side can name them `<name>-<arch>` (eg. `nixos-x86_64`,
60`alpine-aarch64`); that arch suffix is just part of the name, not something
61resolution infers.
62
63A workflow names an image with the `image` key at top-level (falling back to
64`SPINDLE_MICROVM_PIPELINES_DEFAULT_IMAGE` if unset). The name is matched
65literally: we look for `<name>` (a directory with a `spec.json`) then
66`<name>.json`. Resolution depends only on the name and what is on disk, never on
67the host, so the same workflow resolves identically on every spindle. If for
68example an operator wants `nixos` to work, they can symlink `nixos` to
69`nixos-x86_64`.
70
71The spec is validated at resolve time (required fields, positive sizes etc.),
72and right before launch we also check the referenced files actually exist on
73disk and that the host has the commands we need: `mkfs.ext4` for volume
74formatting, plus whatever the selected runner requires. For QEMU that's the QEMU
75binary for the spec's arch, `/dev/vhost-vsock`, `/dev/kvm` (if KVM is enabled),
76and the `ip`, `mount`, `slirp4netns`, `unshare` toolchain when the image has
77network interfaces.
78
79## microVM lifecycle
80
81```mermaid
82flowchart LR
83 Init["InitWorkflow<br/><small>parse manifest, resolve image, build steps</small>"]
84 Acquire["AcquireWorkflowSlot<br/><small>queue until resources fit budget</small>"]
85 Setup["SetupWorkflow<br/><small>proxies, VM, agent handshake</small>"]
86 Run["RunStep ×N<br/><small>exec via agent</small>"]
87 Destroy["DestroyWorkflow<br/><small>drain cache, poweroff, cleanup</small>"]
88
89 Init --> Acquire --> Setup --> Run --> Destroy
90```
91
92While a workflow is running, things look like this (everything inside the cgroup
93box is what gets resource-limited):
94
95```mermaid
96flowchart LR
97 subgraph Host["spindle host"]
98 Hub["agent hub"]
99 ReadProxy["read cache proxy"]
100 UploadProxy["upload cache proxy"]
101 subgraph Cgroup["per-workflow cgroup"]
102 QEMU["qemu"]
103 Slirp["slirp4netns"]
104 end
105 end
106
107 subgraph Guest["guest"]
108 Agent["guest agent"]
109 end
110
111 Agent -->|"vsock"| Hub
112 Agent -->|substitutions| ReadProxy
113 Agent -->|built paths| UploadProxy
114 QEMU --- Guest
115 Slirp -->|outbound only| Internet["the internet"]
116 ReadProxy --> Substituters["upstream caches"]
117 UploadProxy --> NixCache["spindle nix cache"]
118```
119
120`InitWorkflow` parses the workflow manifest, resolves the image, and assembles
121the step list: the clone step first, then (for NixOS images with a workflow
122config) a "NixOS config activation" system step, then the user steps. Before any
123of this actually runs the workflow has to acquire a slot from the resource
124scheduler, each image declares its memory/vCPUs/disk and workflows queue until
125their request fits within the configured budget. The scheduler is
126work-conserving with aging and per-user fairness, so one user submitting a pile
127of jobs won't starve everyone else, and slots don't sit idle while there's
128queued work that fits in the budget.
129
130### Configuration
131
132Setup allocates a random vsock CID for the guest and registers it with the agent
133hub, which listens on a single host vsock port. Incoming agent connections are
134matched to workflows by CID, anything with an unknown CID is dropped. It then
135creates a per-workflow work directory and starts three host-side proxies the guest
136reaches over vsock: a read cache proxy (fronting the configured Nix substituters
137plus any workflow-level `caches`) and an upload cache proxy (for pushing paths
138built in the guest to the spindle's cache), plus a DNS proxy that resolves
139through the host's resolver and filters private/special-purpose address answers.
140
141Then the VM itself. Writable volumes from the spec are created as sparse files
142and formatted ext4, the store disk is attached read-only. QEMU runs with
143`-sandbox on`, `-nodefaults`, no display/monitor, etc., serial output to a log
144file, and a QMP socket for control.
145
146For network hardening: if the image has network interfaces, QEMU doesn't run in
147the host network namespace at all. We `unshare` into fresh user/net/mount
148namespaces, and a small wrapper script inside the namespace bind-mounts a
149resolv.conf that disables qemu's slirp DNS and adds blackhole routes for every
150special-use IPv4/IPv6 range (RFC 6890, so private networks, link-local,
151loopback, CGNAT, multicast, ULAs and so on) before exec'ing QEMU. `slirp4netns`
152(with `--disable-host-loopback`, sandbox and seccomp enabled) then provides
153outbound connectivity for the namespace. The guest's `/etc/resolv.conf` points
154at shuttle on localhost; shuttle forwards DNS packets over vsock to the
155host-side DNS proxy. The guest sits behind a second layer of QEMU user-mode
156networking inside that namespace, so guest traffic can only ever reach the
157outside world, never the host or anything on its local networks.
158
159Optionally the whole thing (QEMU and slirp4netns) is placed in a per-workflow
160cgroup with memory, swap and pids limits, so the budget above is actually
161enforced and not just bookkeeping. That also allows us to, for example, if the
162cgroup OOM-kills the VM we can detect that and report it as such instead of a
163generic crash. The spindle supervisor itself also gets a cgroup with a
164protected `memory.min`, so under host memory pressure it's the workflows that
165get OOM-killed first, not spindle.
166
167### Boot - run - death
168
169Once QEMU is up we poll the QMP socket until it accepts a connection and reports
170the guest as running, then wait for the guest agent to send handshake message
171over vsock from the expected CID. It reports its protocol and versions, and
172spindle sends it the job id, trusted cache public keys, and the cache/DNS proxy
173ports.
174
175First the activation step is ran (if on a NixOS image and the workflow is
176configured with anything), spindle sends the user config (or a cached toplevel
177store path, if we've built this exact base + config combo before) and the agent
178builds and activates it before the user steps run. Afterwards, each step is sent
179as an exec request (`$shell -lc <command>` as an unprivileged workflow user in
180`/workspace/repo`, with workflow/step environment and unlocked secrets), and
181stdout/stderr stream back as messages until an exit message arrives. Timeouts
182are cooperative: we derive a deadline from the workflow timeout and ship it to
183the guest, with a little grace on the host side so the guest gets to report the
184timeout itself. While a step runs we also watch for the VM crashing, if it does
185we tail the serial (and qemu) logs into the step's stderr so you get something
186more useful than "guest agent connection lost: EOF".
187
188Teardown is same whether the workflow succeeded, failed or timed out: drain the
189guest's pending Nix cache uploads, ask the agent to power off and wait for QEMU
190to exit (falling back to QMP `system_powerdown` and finally a kill if it
191doesn't), then close the proxies and remove the work directory. For non-HTTP
192upload targets the host-side import already happened synchronously when the
193guest committed each narinfo, so there is no second host-side cache drain step
194at teardown.
195
196### Nix cache
197
198The two host-side proxies are how the guest talks to spindle's Nix cache without
199ever needing credentials or direct network access; like the agent they reach the
200host over vsock.
201
202The read proxy fronts the configured substituters plus any workflow-level
203`caches`. When the guest needs to realize a store path it asks the proxy, which
204queries the read caches concurrently and returns the first successful response,
205with a 404 only winning if every upstream returns 404.
206
207The upload proxy goes the other way: paths built inside the guest are pushed to
208spindle's configured upload cache (if any) so the next workflow that needs them
209doesn't rebuild. Paths already present on any configured read cache are skipped.
210
211For `http://` and `https://` upload targets the proxy just reverse-proxies the
212guest's binary-cache upload traffic to the configured remote cache, while still
213answering narinfo existence checks across the upload target plus the read
214caches.
215
216For `ssh://`, `ssh-ng://`, `daemon`, and `local` targets spindle implements the
217small HTTP binary-cache upload surface itself. It stages uploaded `nar/` objects
218and narinfos under the workflow workdir, validates the narinfo, then treats the
219narinfo upload as the commit point: once `<hash>.narinfo` is written spindle
220runs:
221
222```bash
223nix copy \
224 --from file://<staging-dir> \
225 --to <target-store> \
226 --no-check-sigs \
227 --substitute-on-destination \
228 <store-path>
229```
230
231That copy is synchronous. If it fails, spindle removes the staged narinfo again
232so future `GET`/`HEAD <hash>.narinfo` requests do not falsely dedupe a path that
233never made it to the destination store. The guest still only ever sees the same
234HTTP binary-cache upload protocol over vsock; it never gets direct access to
235SSH credentials or the destination store itself.