blog: remove microvm post · tangled.org/core@8318031

-277

1 changed file

Expand all

blog

posts

-277

blog/posts/spindle-microvm.md

··· 1 - --- 2 - atroot: true 3 - template: 4 - slug: spindle-microvm 5 - title: How the microVM engine comes together 6 - subtitle: spindle has a microVM engine now! 7 - date: 2026-06-16 8 - image: https://assets.tangled.network/blog/seed.png 9 - authors: 10 - - name: dawn 11 - email: dawn@tangled.org 12 - handle: ptr.pet 13 - --- 14 - 15 - Since launching, [spindle](/ci) has run your CI inside Docker containers created 16 - with nixery. That's been mostly okay if you are doing simple things, but if you 17 - wanted to do anything more outside the box (maybe you wanted some services, or 18 - to build & test containers inside), or if you wanted to use Nix inside it (which 19 - is rough :P), it wouldn't meet your needs. That changes today! 20 - 21 - spindle gains a microVM engine. Each workflow gets its own little virtual 22 - machine. You get a full environment inside your workflows that you can do 23 - whatever you want with without any of the roughness of nixery containers. 24 - Alongside this, you also get the ability to configure *services* that a workflow 25 - will have (on the NixOS image), so that means you can easily have postgres, 26 - Docker, and so on that will be alive through the workflow. 27 - 28 - ## what's in a microVM 29 - 30 - A microVM is just a VM with most of the boring parts removed. There's no BIOS, 31 - no PCI bus to probe, no emulated graphics card, none of the slow legacy stuff a 32 - normal QEMU machine drags along for example. You get virtio devices and not much 33 - else, which means it boots very quickly and uses very little memory. Right now 34 - QEMU is the only runner we support, but the engine is written so that other 35 - runners (firecracker for example) can slot in later. 36 - 37 - Inside the guest there's a small piece of software we call the agent. Spindle 38 - never SSHes in or runs commands "from the outside"; instead the agent dials back 39 - to spindle over vsock the moment it boots, says hello, and from then on every 40 - step of your workflow is sent to it as a message. The agent runs the command as 41 - an unprivileged user, streams stdout and stderr back, and reports the exit code. 42 - The host side of this lives in 43 - [`spindle`](https://tangled.org/tangled.org/core/tree/master/spindle/engines/microvm/agent.go) 44 - and the guest side is a little Rust binary called 45 - [`shuttle`](https://tangled.org/tangled.org/core/tree/master/shuttle). 46 - (`shuttle` implements 47 - [`agentproto`](https://tangled.org/tangled.org/core/tree/master/spindle/) which 48 - is the protocol used by `spindle`. Technically speaking anyone could implement 49 - this and, assuming side effects hold, you could have your own agent!) 50 - 51 - ## two kinds of images 52 - 53 - There are two "flavours" of image you can boot, and they're aimed at fairly 54 - different people. 55 - 56 - The first is **NixOS images**. These are the interesting ones: because the whole 57 - guest is built with Nix, you can configure it from your workflow file directly. 58 - Things like `dependencies`, `services`, `virtualisation` (e.g. Docker), 59 - `registry` and `caches` are all written right there in the YAML, and the guest 60 - agent builds and activates that config before any of your steps run. If we've 61 - built that exact base plus config before, spindle can just hand the guest a 62 - store path to realize (fetching from whatever cache `spindle` has configured) 63 - instead of rebuilding it, so the second run is quick. 64 - 65 - The second is **non-NixOS images**, which today just means Alpine, but can be 66 - anything. You don't get the workflow-level NixOS config here (there's no NixOS 67 - to configure), but if Nix happens to exist inside the image, like it does in our 68 - Alpine one, it can still talk to the spindle Nix cache just fine. 69 - 70 - ### example nixos workflow 71 - 72 - If you've used spindle before this will look familiar, it's the same manifest you 73 - already know, just with a few extra keys that the NixOS image understands. Here's 74 - a workflow that needs postgres to test against and Docker to build an image: 75 - 76 - ```yaml 77 - # .tangled/workflows/test.yaml 78 - engine: microvm 79 - 80 - when: 81 - - event: ["push", "pull_request"] 82 - branch: ["master"] 83 - 84 - image: nixos 85 - 86 - dependencies: 87 - - go 88 - - github:nixos/nixpkgs#hello 89 - 90 - registry: 91 - nixpkgs: github:nixos/nixpkgs/nixos-unstable 92 - 93 - caches: 94 - https://nix-community.cachix.org: "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs=" 95 - 96 - services: 97 - postgresql: 98 - enable: true 99 - ensureDatabases: ["spindle-workflow"] 100 - ensureUsers: 101 - - name: spindle-workflow 102 - ensureDBOwnership: true 103 - 104 - virtualisation: 105 - docker: true 106 - 107 - steps: 108 - - name: run tests 109 - environment: 110 - PGHOST: /run/postgresql 111 - command: | 112 - docker build -t app . 113 - psql -c "select 1" 114 - go test ./... 115 - ``` 116 - 117 - `dependencies` are packages that are added to `environment.systemPackages` (so, 118 - `PATH`). A bare name like `go` is looked up in nixpkgs (same as regular 119 - spindle), but you can also point at any flake with the `flakeref#attr` syntax, 120 - so `github:nixos/nixpkgs#hello` pulls `hello` straight out of that flake. 121 - `registry` is how you remap the global refs: here we pin `nixpkgs` to 122 - `nixos-unstable`, so now the bare `go` above resolves from unstable. You can 123 - alias your own flakes the same way (`myflake: github:me/x`, then `myflake#tool` 124 - in `dependencies`). `caches` is a map of binary cache URL to its trusted public 125 - key, and they get wired into the read proxy (more on that later), so the guest 126 - can substitute prebuilt paths from them instead of building everything from 127 - scratch. 128 - 129 - `services` and `virtualisation` are the interesting parts: they're passed 130 - straight through to NixOS, so anything you could write in a NixOS config you can 131 - write here. `services.postgresql.enable` brings postgres up before any of your 132 - steps run. Since steps run as the `spindle-workflow` user, naming a database 133 - after that user with `ensureDBOwnership` is the easy path to a working db - 134 - postgres peer auth maps the unix user straight to the matching role, so `psql` 135 - connects over the socket with no password and no extra setup (this name-matching 136 - is a NixOS requirement for `ensureDBOwnership`, if you want a differently named 137 - db you'd grant access yourself). `virtualisation.docker: true` is shorthand for 138 - `virtualisation.docker.enable = true`, which gets you a real Docker daemon 139 - inside the VM. By the time your first step runs, postgres is listening and the 140 - Docker socket is there, no sidecar dance, it's just part of the machine. 141 - 142 - (`true` works as shorthand for `.enable = true` anywhere an `enable` option 143 - exists, so most "just turn this on" services are a one-liner!) 144 - 145 - ## building the images 146 - 147 - Image builds are done with Nix. For NixOS we lean on 148 - [microvm.nix](https://github.com/microvm-nix/microvm.nix) and layer our own bits 149 - on top (stripping down kernel modules, configuring users, etc.). For Alpine 150 - there's a smallish Nix definition that fetches the kernel, the initrd and the 151 - kernel modules, sets up an init script that configures the machine on boot, 152 - copies in the dependencies we want (`nix`, `git`, etc.) and compresses the whole 153 - rootfs into a squashfs. 154 - 155 - None of this *has* to be Nix, though. As far as spindle is concerned an image is 156 - valid as long as a few things hold: a guest agent (that implements `agentproto`) 157 - is present and gets started on boot, a `spindle-workflow` user exists, and the 158 - work directory is set up at `/workspace`. That can be built however you like. 159 - 160 - ## finding an image 161 - 162 - Every built image ships a `spec.json` next to its artifacts. The spec is the 163 - whole contract: where the kernel and initrd and read-only store disk live, the 164 - boot args, how much memory and how many vCPUs to give it, the shell to run steps 165 - in, the writable volumes, the network interfaces, and the runner-specific knobs 166 - (machine type, CPU, extra QEMU args). NixOS images also carry a `baseConfigHash` 167 - identifying the base config baked in (this is the hash of 168 - `nixosSystem.config.system.build.toplevel.outPath`). 169 - 170 - A workflow picks an image with the `image` key at the top level. The name is 171 - matched literally against what's on disk, we look for a directory called 172 - `<name>` with a `spec.json` in it, then fall back to a flat `<name>.json`. The 173 - nice property here is that resolution depends *only* on the name and what's on 174 - disk, never on the host doing the resolving, so the same workflow resolves to 175 - the same image on every spindle. If an operator keeps multiple arches side by 176 - side they can name them `nixos-x86_64`, `alpine-aarch64` and so on (that suffix 177 - is just part of the name, it's not handled specially). If you want, for example, 178 - `nixos` to work, you can just symlink `nixos` to `nixos-x86_64`. 179 - 180 - Right before launch we double-check the referenced files actually exist 181 - and that the host has the tools we need: `mkfs.ext4` for the volumes, the 182 - QEMU binary for the spec's arch, `/dev/kvm` and `/dev/vhost-vsock`, plus 183 - the `ip` / `mount` / `slirp4netns` / `unshare` toolchain if the image 184 - wants networking. 185 - 186 - ## the life of a workflow 187 - 188 - A workflow moves through a handful of stages: it gets parsed and its 189 - image resolved, it waits for a slot, it gets set up, its steps ran, and 190 - then everything is torn down. 191 - 192 - The waiting bit matters a lot. Each image declares how much memory, how many 193 - vCPUs and how much disk it needs, and a workflow has to acquire a slot from a 194 - resource scheduler before anything boots. The scheduler is work-conserving with 195 - aging and per-user fairness, so one person submitting a hundred jobs won't 196 - starve everyone else, and slots don't sit idle if there's work that fits in the 197 - budget. 198 - 199 - Once a slot is acquired, we do the setup. Spindle allocates a random vsock CID 200 - for the guest and registers it with the agent hub. It creates the per-workflow 201 - work directory, starts the two cache proxies (more on those later), then creates 202 - the VM: writable volumes become sparse files formatted ext4, the store disk is 203 - attached read-only, and QEMU is started with `-sandbox on`, `-nodefaults`, no 204 - display, no monitor, etc. with serial / `virtio_console` output to a log file 205 - and a QMP socket for control. 206 - 207 - Then we wait for the machine. We poll QMP until QEMU says the guest is running, 208 - then wait for the agent's handshake to arrive over vsock from the CID we expect. 209 - The agent tells us its protocol and versions, and spindle sends back the job id, 210 - the trusted cache public keys and the cache proxy ports, NixOS config if already 211 - cached... From there steps run one at a time as `$shell -lc <command>`, as the 212 - unprivileged workflow user in `/workspace/repo`, with the right environment and 213 - any unlocked secrets. 214 - 215 - Timeouts are cooperative: we work out a deadline from the workflow timeout 216 - and ship it to the guest, with a little grace on our side so the guest 217 - gets a chance to report the timeout itself rather than us just yanking the 218 - machine out from under it. And if the VM crashes mid-step we tail the 219 - serial and QEMU logs into the step's stderr, because "guest agent 220 - connection lost: EOF" is a genuinely useless thing to read at 2am. 221 - 222 - Teardown is the same whether the workflow passed, failed or timed out: 223 - drain any pending Nix cache uploads, ask the agent to power off, wait for 224 - QEMU to exit (falling back to a QMP `system_powerdown`, and finally a 225 - kill if it's being stubborn), then close the proxies and remove the work 226 - directory. 227 - 228 - ## locking down the network 229 - 230 - A VM that can reach the host's local network is a VM that can reach things it 231 - has no business reaching. So QEMU doesn't run in the host's network namespace at 232 - all. We `unshare` into fresh user, net and mount namespaces first. Inside that 233 - namespace a small wrapper bind-mounts a resolv.conf pointing at the slirp DNS 234 - and installs blackhole routes for every special-use IP range (RFC 6890, so 235 - private networks, link-local, loopback, etc.) before it execs QEMU. 236 - `slirp4netns` then provides outbound connectivity for that namespace, with 237 - `--disable-host-loopback`, sandbox and seccomp all on. The guest itself sits 238 - behind a *second* layer of QEMU user-mode networking inside that namespace. All 239 - of this is done without needing any privileges! 240 - 241 - ## budgets and cgroups 242 - 243 - The scheduler's budget is bookkeeping on its own, it tracks what it's handed 244 - out, and the runner (QEMU) will ensure that a workflow only gets those. But 245 - optionally the whole thing (QEMU and slirp4netns both) gets placed in a 246 - per-workflow cgroup with memory, swap etc. limits, which is an extra 247 - enforcement layer on top. A nice side effect is when the cgroup OOM-kills the VM 248 - we can see that it was an OOM and report it as such, instead of surfacing it as 249 - a generic crash and leaving you guessing. 250 - 251 - The spindle itself also gets a cgroup, which means that in a host OOM situation, 252 - it should be the workflows that die first, not the spindle itself. 253 - 254 - ## the nix cache, both ways 255 - 256 - The two proxies I mentioned during setup are how the guest talks to spindle's 257 - Nix cache, and they run on the host so the guest never needs credentials or 258 - direct network access to do it. Like the agent, they also use vsock to 259 - communicate with the spindle. 260 - 261 - The read proxy fans out to the configured substituters plus any caches you 262 - listed in your workflow, so when the guest needs to realize a store path it asks 263 - the proxy and the proxy fetches it. The request is sent concurrently to the read 264 - caches, so the one that answers it first wins. 265 - 266 - The upload proxy goes the other way: any path built inside the guest gets pushed 267 - back out to spindle's Nix cache (if one is configured), so the next workflow 268 - that needs it doesn't have to build it again. Any paths that already exist on 269 - any of the configured read caches won't be uploaded. Built paths are queued by 270 - the agent and are immediately uploaded. If any paths are still left when we 271 - reach VM teardown, the workflow will wait until everything is uploaded. 272 - 273 - ## in the future 274 - 275 - todo 276 - 277 - Feel free to come and ask any questions you might have on https://chat.tangled.sh!

Configure Feed

Configure Feed