blog: add microvm post · tangled.org/core@4c808c1

+391

1 changed file

Expand all

blog

posts

+391

blog/posts/spindle-microvm.md

··· 1 + --- 2 + atroot: true 3 + template: 4 + slug: spindle-microvm 5 + title: Spindle's new microVM engine 6 + subtitle: How we built the new QEMU-based microVM engine 7 + date: 2026-06-16 8 + image: https://assets.tangled.network/blog/microvm.png 9 + authors: 10 + - name: dawn 11 + email: dawn@tangled.org 12 + handle: ptr.pet 13 + --- 14 + 15 + Spindle gains a second engine: `microvm`. Each workflow gets its own little 16 + virtual machine, a whole real environment you can do anything inside. It's an 17 + upgrade from the Nixery engine while staying fully compatible with it, so if you 18 + already have a working Nixery workflow, just change `nixery` to `microvm` and it 19 + will work! 20 + 21 + The interesting part is NixOS images: you configure the machine directly from 22 + the workflow file. A few things you can do: 23 + 24 + You can bring services up: 25 + 26 + ```yaml 27 + services: 28 + postgresql: 29 + enable: true 30 + ensureDatabases: ["spindle-workflow"] 31 + ensureUsers: 32 + - name: spindle-workflow 33 + ensureDBOwnership: true 34 + ``` 35 + 36 + You can build Docker containers: 37 + 38 + ```yaml 39 + virtualisation: 40 + docker: true 41 + steps: 42 + - name: "do the thing!" 43 + command: docker build ... 44 + ``` 45 + 46 + And you can use non-NixOS images too: 47 + 48 + ```yaml 49 + image: alpine 50 + steps: 51 + - name: install golang 52 + command: apk add go 53 + ``` 54 + 55 + It's quick on the second run, too, because it caches aggressively: your 56 + dependencies, your services, and any other Nix derivation built inside the 57 + microVM get pushed to spindle's Nix cache, so the next workflow that needs them 58 + doesn't rebuild those. More on that [below](#the-nix-cache-both-ways). 59 + 60 + And like everything else in tangled, the whole thing is self-hostable, so you 61 + can run your own spindle with the microVM engine on your own hardware (see the 62 + [self-hosting 63 + guide](https://docs.tangled.org/spindles.html#self-hosting-guide)). If you want 64 + fuller examples, there are [recipes in the 65 + docs](https://docs.tangled.org/spindles.html#recipes) too. 66 + 67 + ## What's in a microVM 68 + 69 + A microVM is just a VM with most of the boring parts removed. There's no BIOS, 70 + no PCI bus to probe, no emulated graphics card, none of the slow legacy stuff a 71 + normal QEMU machine drags along. You get virtio devices and not much 72 + else, which means it boots very quickly and uses very little memory. Right now 73 + QEMU is the only runner we support, but the engine is written so that other 74 + runners (firecracker for example) can slot in later. 75 + 76 + Inside the guest there's a small piece of software we call the agent. Spindle 77 + never SSHes in or runs commands "from the outside"; instead the agent dials back 78 + to spindle over vsock the moment it boots, says hello, and from then on every 79 + step of your workflow is sent to it as a message. The agent runs the command as 80 + an unprivileged user, streams stdout and stderr back, and reports the exit code. 81 + The host side of this lives in 82 + [`spindle`](https://tangled.org/tangled.org/core/tree/master/spindle/engines/microvm/agent.go) 83 + and the guest side is a little Rust binary called 84 + [`shuttle`](https://tangled.org/tangled.org/core/tree/master/shuttle). 85 + (`shuttle` implements 86 + [`agentproto`](https://tangled.org/tangled.org/core/tree/master/spindle/) which 87 + is the protocol used by `spindle`. Technically speaking anyone could implement 88 + this and, assuming side effects hold, you could have your own agent!) 89 + 90 +  104 + 105 + ## Two kinds of images 106 + 107 + There are two "flavours" of image you can boot, and they're aimed at fairly 108 + different people. 109 + 110 + The first is **NixOS images**. These are the interesting ones: because the whole 111 + guest is built with Nix, you can configure it from your workflow file directly. 112 + Things like `dependencies`, `services`, `virtualisation` (e.g. Docker), 113 + `registry` and `caches` are all written right there in the YAML, and the guest 114 + agent builds and activates that config before any of your steps run. If we've 115 + built that exact base plus config before, spindle can just hand the guest a 116 + store path to realize (fetching from whatever cache `spindle` has configured) 117 + instead of rebuilding it, so the second run is quick. 118 + 119 + The second is **non-NixOS images**, which today just means Alpine, but can be 120 + anything. You don't get the workflow-level NixOS config here (there's no NixOS 121 + to configure), but if Nix happens to exist inside the image, like it does in our 122 + Alpine one, it can still talk to the spindle Nix cache just fine. 123 + 124 + ## An example NixOS workflow 125 + 126 + If you've used spindle before, this will look familiar: it's the same manifest 127 + you already know, just with a few extra keys that the NixOS image understands. 128 + Here's a workflow that needs Postgres to test against and Docker to build an 129 + image: 130 + 131 + ```yaml 132 + # .tangled/workflows/test.yaml 133 + engine: microvm 134 + 135 + when: 136 + - event: ["push", "pull_request"] 137 + branch: ["master"] 138 + 139 + image: nixos 140 + 141 + dependencies: 142 + - go 143 + - github:nixos/nixpkgs#hello 144 + 145 + registry: 146 + nixpkgs: github:nixos/nixpkgs/nixos-unstable 147 + 148 + caches: 149 + https://nix-community.cachix.org: "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs=" 150 + 151 + services: 152 + postgresql: 153 + enable: true 154 + ensureDatabases: ["spindle-workflow"] 155 + ensureUsers: 156 + - name: spindle-workflow 157 + ensureDBOwnership: true 158 + 159 + virtualisation: 160 + docker: true 161 + 162 + steps: 163 + - name: run tests 164 + environment: 165 + PGHOST: /run/postgresql 166 + command: | 167 + docker build -t app . 168 + psql -c "select 1" 169 + go test ./... 170 + ``` 171 + 172 + The new keys each do one job: 173 + 174 + - **`dependencies`** are the packages your steps get to use. They go into a 175 + `mkShellNoCC` devshell that every step sources before it runs, so you get the 176 + whole stdenv environment (setup hooks like `pkg-config` wiring up 177 + `PKG_CONFIG_PATH`, etc.) and not just the bare binaries. That means you can 178 + use a dependency like `openssl` and compile the `openssl-sys` Rust crate 179 + without pain! A bare name like `go` is looked up in nixpkgs (same as Nixery), 180 + but you can also point at any flake with the `flakeref#attr` syntax, so 181 + `github:nixos/nixpkgs#hello` pulls `hello` straight out of that flake. 182 + - **`registry`** is how you remap the global refs. Here we pin `nixpkgs` to 183 + `nixos-unstable`, so now the bare `go` above resolves from unstable. You can 184 + alias your own flakes the same way (`myflake: github:me/x`, then 185 + `myflake#tool` in `dependencies`). 186 + - **`caches`** is a map of binary cache URL to its trusted public key. They get 187 + wired into the read proxy (more on that just below), so the guest can 188 + substitute prebuilt paths from them instead of building everything from 189 + scratch. 190 + 191 + `services` and `virtualisation` are the interesting parts: they're passed 192 + straight through to NixOS, so anything you could write in a NixOS config you can 193 + write here. `services.postgresql.enable` brings Postgres up before any of your 194 + steps run. 195 + 196 + Since steps run as the `spindle-workflow` user, naming a database after that 197 + user with `ensureDBOwnership` is the easy path to a working DB -- Postgres peer 198 + auth maps the unix user straight to the matching role, so `psql` connects over 199 + the socket with no password and no extra setup (this name-matching is a NixOS 200 + requirement for `ensureDBOwnership`, if you want a differently named DB you'd 201 + grant access yourself). 202 + 203 + `virtualisation.docker: true` is shorthand for `virtualisation.docker.enable = 204 + true`, which gets you a real Docker daemon inside the VM. By the time your first 205 + step runs, Postgres is listening and the Docker socket is there, no sidecar 206 + dance, it's just part of the machine. 207 + 208 + (`true` works as shorthand for `.enable = true` anywhere an `enable` option 209 + exists, so most "just turn this on" services are a one-liner!) 210 + 211 + ## The architecture 212 + 213 +  223 + 224 + ### Nix cache, both ways 225 + 226 + Spindle talks to its Nix cache through two proxies that run on the host, so the 227 + guest never needs credentials or direct network access to reach it. Like the 228 + agent, they use vsock to talk to spindle. 229 + 230 + The read proxy fans out to the configured substituters plus any caches you 231 + listed in your workflow, so when the guest needs to realize a store path it asks 232 + the proxy and the proxy fetches it. The request is sent concurrently to the read 233 + caches, so the one that answers it first wins. 234 + 235 + The upload proxy goes the other way: any path built inside the guest gets pushed 236 + back out to spindle's Nix cache (if one is configured), so the next workflow 237 + that needs it doesn't have to build it again. Any paths that already exist on 238 + any of the configured read caches won't be uploaded. As the agent reports 239 + built paths, they're queued and uploaded in the background while the rest of the 240 + workflow keeps running, so uploads overlap with work instead of blocking it. If 241 + any are still in flight when we reach VM teardown, the workflow waits until 242 + everything has drained. 243 + 244 + Spindle can be configured to use `http`, `ssh-ng` or `ssh` URLs as a binary 245 + cache to upload to, so for example, `ssh-ng://localhost` would just upload to 246 + the local Nix store on the machine that the spindle runs on! `ssh-ng` and `ssh` 247 + require Nix to be present in PATH so that the spindle can use `nix copy` to 248 + upload to them, but if you are using a binary cache that supports `http` (for 249 + example, [ncps](https://github.com/kalbasit/ncps)) Nix does not need to be 250 + present. 251 + 252 + ### Building the images 253 + 254 + Image builds are done with Nix. For NixOS we lean on 255 + [microvm.nix](https://github.com/microvm-nix/microvm.nix) and layer our own bits 256 + on top (stripping down kernel modules, configuring users, etc.). For Alpine 257 + there's a smallish Nix definition that fetches the kernel, the initrd and the 258 + kernel modules, sets up an init script that configures the machine on boot, 259 + copies in the dependencies we want (`nix`, `git`, etc.) and compresses the whole 260 + rootfs into a squashfs. 261 + 262 + None of this *has* to be Nix, though. As far as spindle is concerned an image is 263 + valid as long as a few things hold: a guest agent (that implements `agentproto`) 264 + is present and gets started on boot, a `spindle-workflow` user exists, and the 265 + work directory is set up at `/workspace`. That can be built however you like. 266 + 267 + ### Finding an image 268 + 269 + Every built image ships a `spec.json` next to its artifacts. The spec is the 270 + whole contract: where the kernel and initrd and read-only store disk live, the 271 + boot args, how much memory and how many vCPUs to give it, the shell to run steps 272 + in, the writable volumes, the network interfaces, and the runner-specific knobs 273 + (machine type, CPU, extra QEMU args). NixOS images also carry a `baseConfigHash` 274 + identifying the base config baked in (this is the hash of 275 + `nixosSystem.config.system.build.toplevel.outPath`). 276 + 277 + A workflow picks an image with the `image` key at the top level. The name is 278 + matched literally against what's on disk, we look for a directory called 279 + `<name>` with a `spec.json` in it, then fall back to a flat `<name>.json`. The 280 + nice property here is that resolution depends *only* on the name and what's on 281 + disk, never on the host doing the resolving, so the same workflow resolves to 282 + the same image on every spindle. If an operator keeps multiple arches side by 283 + side they can name them `nixos-x86_64`, `alpine-aarch64` and so on (that suffix 284 + is just part of the name, it's not handled specially). If you want, for example, 285 + `nixos` to work, you can just symlink `nixos` to `nixos-x86_64`. 286 + 287 + Right before launch we double-check the referenced files actually exist 288 + and that the host has the tools we need: `mkfs.ext4` for the volumes, the 289 + QEMU binary for the spec's arch, `/dev/kvm` and `/dev/vhost-vsock`, plus 290 + the `ip` / `mount` / `slirp4netns` / `unshare` toolchain if the image 291 + wants networking. 292 + 293 + ### The life of a workflow 294 + 295 + A workflow moves through a handful of stages: it gets parsed and its 296 + image resolved, it waits for a slot, it gets set up, its steps run, and 297 + then everything is torn down. 298 + 299 + The waiting bit matters a lot. Each image declares how much memory, how many 300 + vCPUs and how much disk it needs, and a workflow has to acquire a slot from a 301 + resource scheduler before anything boots. The scheduler is work-conserving with 302 + aging and per-user fairness, so one person submitting a hundred jobs won't 303 + starve everyone else, and slots don't sit idle if there's work that fits in the 304 + budget. 305 + 306 + Once a slot is acquired, we do the setup. Spindle allocates a random vsock CID 307 + for the guest and registers it with the agent hub. It creates the per-workflow 308 + work directory, starts the two cache proxies (described earlier), a DNS proxy 309 + that resolves through the host and filters out private/special-use addresses, 310 + then creates the VM: writable volumes become sparse files formatted ext4, the 311 + store disk is attached read-only, and QEMU is started with `-sandbox on`, 312 + `-nodefaults`, no display, no monitor, etc. with serial (on boot) / 313 + `virtio_console` output to a log file and a QMP socket for control. 314 + 315 + Then we wait for the machine. We poll QMP until QEMU says the guest is running, 316 + then wait for the agent's handshake to arrive over vsock from the CID we expect. 317 + The agent tells us its protocol and versions, and spindle sends back the job id, 318 + the trusted cache public keys, and the cache and DNS proxy ports. From there 319 + steps run one at a time as `$shell -lc <command>`, as the unprivileged workflow 320 + user in `/workspace/repo`, with the right environment and any unlocked secrets. 321 + If the workflow activates a NixOS config and we've already built that exact base 322 + plus config, the activation step can realize a cached toplevel store path instead 323 + of rebuilding. Either way, whether it's building the config fresh or pulling a 324 + cached toplevel down, that output streams straight into the activation step's log 325 + as it happens, so you can watch the closure come in instead of staring at a blank 326 + screen wondering if anything's happening. 327 + 328 + Timeouts are cooperative: we work out a deadline from the workflow timeout 329 + and send it to the guest, with a little grace on our side so the guest 330 + gets a chance to report the timeout itself rather than us just yanking the 331 + machine out from under it. And if the VM crashes mid-step we tail the 332 + serial and QEMU logs into the step's stderr, because "guest agent 333 + connection lost: EOF" is a genuinely useless thing to read at 2am... 334 + 335 + Teardown is the same whether the workflow passed, failed or timed out: 336 + drain any pending Nix cache uploads, ask the agent to power off, wait for 337 + QEMU to exit (falling back to a QMP `system_powerdown`, and finally a 338 + kill if it's being stubborn), then close the proxies and remove the work 339 + directory. 340 + 341 + ### Locking down the network 342 + 343 + A VM that can reach the host's local network is a VM that can reach things it 344 + has no business reaching. So QEMU doesn't run in the host's network namespace at 345 + all. We `unshare` into fresh user, net and mount namespaces first. Inside that 346 + namespace a small wrapper bind-mounts a resolv.conf pointing at `127.0.0.1` so 347 + that QEMU's built-in slirp DNS isn't used, then installs blackhole routes for 348 + every special-use IP range (RFC 6890, so private networks, link-local, loopback, 349 + etc.) before it execs QEMU. `slirp4netns` then provides the namespace's outbound 350 + internet connection, with `--disable-host-loopback`, sandbox and seccomp all on. 351 + QEMU runs *inside* that namespace, and the guest's network card is attached to 352 + QEMU's own built-in user-mode networking. So every packet from the guest takes 353 + two hops: guest → QEMU's slirp → the namespace's `slirp4netns` → the internet. 354 + The guest never sees the host's network and the host's network never sees the 355 + guest. All of this is done without needing any privileges! 356 + 357 + Guest DNS doesn't use either slirp layer. The guest's `/etc/resolv.conf` points 358 + at shuttle on `127.0.0.1:53`, and shuttle forwards DNS packets over vsock to 359 + the host-side DNS proxy. That proxy resolves through the host's real resolver 360 + and strips any answers that point at private or special-use addresses, so guest 361 + traffic can only ever reach the outside world, never the host or anything on its 362 + local networks. 363 + 364 + ### Budgets and cgroups 365 + 366 + The scheduler's budget is bookkeeping on its own, it tracks what it's handed 367 + out, and the runner (QEMU) will ensure that a workflow only gets those. But 368 + optionally the whole thing (QEMU and slirp4netns both) gets placed in a 369 + per-workflow cgroup with memory, swap etc. limits, which is an extra enforcement 370 + layer on top, considering QEMU and slirp4netns themselves also use resources. A 371 + nice side effect is that when the cgroup OOM-kills the VM we can see that it was 372 + an OOM and report it as such, instead of surfacing it as a generic crash and 373 + leaving you guessing. 374 + 375 + The spindle itself also gets a cgroup with `memory.min` set, which means that in 376 + a host OOM situation, it should be the workflows that die first, not the spindle 377 + itself. 378 + 379 + ## On the roadmap 380 + 381 + A few things that are coming next: 382 + 383 + - [firecracker](https://github.com/firecracker-microvm/firecracker) runner 384 + support. QEMU microVMs are good and all, but firecracker VMs are more 385 + efficient to run concurrently and are leaner overall. 386 + - ssh-on-fail: when a workflow fails, you should be able to ssh in to debug why. 387 + This can be really useful in situations where you need just *a little* bit 388 + more info if something unexpected fails so you don't sit around there running 389 + the workflow 10 times over. 390 + 391 + Feel free to come and ask any questions you might have on https://chat.tangled.sh!

Configure Feed

Configure Feed