Add cloud-hypervisor VMM backend for Linux hosts#782
Conversation
apple/containerization currently runs containers in per-container VMs on macOS hosts via Virtualization.framework. This adds a second VMM backend so the same Swift orchestration layer (LinuxContainer / LinuxPod / Vminitd gRPC contract) runs on Linux hosts via cloud-hypervisor + KVM. **CloudHypervisor Swift package** (`Sources/CloudHypervisor/`) — a thin client for cloud-hypervisor's REST-over-UDS API, layered on AsyncHTTPClient. Endpoints cover VMM / VM lifecycle / hotplug (disk, fs, net, vsock, remove-device). Cross-platform (compiles on macOS for unit tests; consumed at runtime only by the Linux side of Containerization). **CH backend in Containerization** — one cloud-hypervisor subprocess per VM, gated behind `#if os(Linux)`. CHVirtualMachineManager / CHVirtualMachineInstance mirror the VZ shape behind the existing VirtualMachineManager / VirtualMachineInstance protocol. CHProcess and VirtiofsdProcess manage the binaries; CHHotplugProvider handles virtio-blk and virtio-fs runtime hotplug (with one virtiofsd per unique source-hash tag, refcounted across containers). **Linux host networking** — BridgeManager brings up a Linux bridge with an IPv4 subnet and (opt-in via `--enable-nat`) iptables MASQUERADE + scoped FORWARD rules. LinuxBridgedNetwork enslaves a fresh TAP per container to the bridge. State is recorded under `/run/containerization` so `cctl bridge delete` reverses exactly what create did. Bridge teardown verifies the link kind via sysfs to refuse deleting non-bridge interfaces. **cctl run / bridge** — end-to-end Linux container run path (image pull, ext4 rootfs assembly, VM boot, container exec) plus `cctl bridge create|delete` for the host network plumbing. **Build & dist** — `make linux-build` / `make linux-integration` build and exercise the host side inside an apple/container `--virtualization` dev container. `make dist-x86_64` produces a deployment tarball (cctl + cloud-hypervisor + virtiofsd + initfs + kernel) cross-compiled from the aarch64 dev container; pipeline documented in `docs/x86_64-build.md`. Static-musl C deps and the Zig cross compiler are pinned by SHA256. The host orchestrator runs as root. Per-VM runtime state lives under `/run/containerization/ch/<UUID>` with mode 0700; UDS sockets inside are bound with mode 0600. Vminitd's gRPC channel inherits that trust boundary — socket-file perms are the auth. Sandbox flags are upstream-secure by default. Two per-component opt-outs exist for the apple/container dev-container case (where the host seccomp profile SIGSYS-kills CH and virtiofsd): - `CONTAINERIZATION_NO_CH_SECCOMP=1` — `cloud-hypervisor --seccomp false`. - `CONTAINERIZATION_NO_VIRTIOFSD_SANDBOX=1` — `virtiofsd --sandbox none`. Each emits a one-shot `logger.warning` at process start. Legacy alias `CONTAINERIZATION_RELAXED_SANDBOX=1` flips both. cctl spawns both binaries with `setsid` and a minimal env allowlist (PATH / HOME / RUST_LOG / RUST_BACKTRACE) so the parent's secrets don't leak to children. `make linux-integration` runs the cross-platform integration suite against a real cloud-hypervisor VM inside the dev container. Linux runs the cross-platform subset (`process true`/`false`/`echo hi`, virtiofs round-trip, hotplug); the macOS suite is unchanged. Signed-off-by: michael_crosby <michael_crosby@apple.com>
jglogan
left a comment
There was a problem hiding this comment.
62 files reviewed, 21 to go...
| #if arch(arm64) | ||
| let kernelPlatform = SystemPlatform.linuxArm | ||
| #else | ||
| let kernelPlatform = SystemPlatform.linuxAmd |
There was a problem hiding this comment.
I think arch can be x86_64, arm64, i386, arm. It would be safer to explicitly test x86_64 and error out in the #else branch.
|
|
||
| var rootfsMount: Containerization.Mount | ||
| do { | ||
| let unpacker = EXT4Unpacker(blockSizeInBytes: fsSizeInMB.mib()) |
There was a problem hiding this comment.
Not the fault if this PR, and we can do this in a follow-on as it'd be a breaking change I think, but it looks like this parameter is misnamed.
This gets routed to minDiskSize in EXT4Formatter. We should call this minDiskSizeInBytes, capacityBytes, or something similar.
| var cpus: Int = 2 | ||
|
|
||
| @Option(name: [.customLong("memory"), .customShort("m")], help: "Amount of memory in MiB") | ||
| var memory: UInt64 = 1024 |
There was a problem hiding this comment.
rename to memoryInMB for clarity
| } | ||
|
|
||
| let cpusCount = cpus | ||
| let memoryBytes = memory * 1024 * 1024 |
| import Foundation | ||
| import NIOHTTP1 | ||
|
|
||
| extension CloudHypervisor { |
There was a problem hiding this comment.
Perhaps name file CloudHypervisor+Error.swift?
| uri: String, | ||
| body: Data?, | ||
| headers: HTTPHeaders = [:] | ||
| ) async throws -> HTTPResponse { |
There was a problem hiding this comment.
Just a note, should we ever use swift-openapi-generator to create the client from the CH openapi spec, this stuff can move into swift-openapi-runtime middleware.
| } | ||
|
|
||
| // 3. Restore ip_forward only if create()-with-NAT set it from 0. | ||
| if state?.natEnabled == true, state?.prevIpForward == "0" { |
There was a problem hiding this comment.
Does this properly track all cases where multiple bridge networks exist with varying nat-enabled states? Is it possible to inadvertently set ip_forward=0 even if there still exist nat-enabled bridges?
| @@ -0,0 +1,91 @@ | |||
| # CLAUDE.md | |||
There was a problem hiding this comment.
CLAUDE.md or AGENTS.md? I don't see anything Claude-specific.
| Every Linux container runs inside its own VM. The boundary between host (macOS) and guest (Linux) is the central architectural fact: | ||
|
|
||
| - **Host side** (`Sources/`, `macOS` platform): orchestrates VMs through `Virtualization.framework` (`VZVirtualMachineInstance.swift`, `VZVirtualMachine+Helpers.swift`). The user-facing entry points are `LinuxContainer` (one container per VM) and `LinuxPod` (multiple containers in one VM, experimental). These build a `VMConfiguration`, boot the VM with the chosen `Kernel` and a rootfs containing `vminitd`, then drive the guest via gRPC. | ||
| - **Guest side** (`vminitd/`, Linux platform): `vminitd` is PID 1 inside the VM. It exposes a gRPC service over **vsock** (default port `1024`) defined by `Sources/Containerization/SandboxContext/SandboxContext.proto`. `VminitdCore` implements that service: managing containers via runc, handling stdio over vsock, signal/event delivery, cgroups, mounts, and process lifecycle. `vmexec` is a small helper used to launch container processes from inside the guest. |
| extension LinuxProcess { | ||
| func setupIO(listeners: [VsockListener?]) async throws -> [FileHandle?] { | ||
| let handles = try await Timeout.run(seconds: 3) { | ||
| let handles = try await Timeout.run(seconds: 30) { |
There was a problem hiding this comment.
Was this meant to be a temporary change? What prompted the 10x increase?
apple/containerization currently runs containers in per-container VMs on macOS hosts via Virtualization.framework. This adds a second VMM backend so the same Swift orchestration layer (LinuxContainer / LinuxPod / Vminitd gRPC contract) runs on Linux hosts via cloud-hypervisor + KVM.
CloudHypervisor Swift package (
Sources/CloudHypervisor/) — a thin client for cloud-hypervisor's REST-over-UDS API, layered on AsyncHTTPClient. Endpoints cover VMM / VM lifecycle / hotplug (disk, fs, net, vsock, remove-device). Cross-platform (compiles on macOS for unit tests; consumed at runtime only by the Linux side of Containerization).CH backend in Containerization — one cloud-hypervisor subprocess per VM, gated behind
#if os(Linux). CHVirtualMachineManager / CHVirtualMachineInstance mirror the VZ shape behind the existing VirtualMachineManager / VirtualMachineInstance protocol. CHProcess and VirtiofsdProcess manage the binaries; CHHotplugProvider handles virtio-blk and virtio-fs runtime hotplug (with one virtiofsd per unique source-hash tag, refcounted across containers).Linux host networking — BridgeManager brings up a Linux bridge with an IPv4 subnet and (opt-in via
--enable-nat) iptables MASQUERADE + scoped FORWARD rules. LinuxBridgedNetwork enslaves a fresh TAP per container to the bridge. State is recorded under/run/containerizationsocctl bridge deletereverses exactly what create did. Bridge teardown verifies the link kind via sysfs to refuse deleting non-bridge interfaces.cctl run / bridge — end-to-end Linux container run path (image pull, ext4 rootfs assembly, VM boot, container exec) plus
cctl bridge create|deletefor the host network plumbing.Build & dist —
make linux-build/make linux-integrationbuild and exercise the host side inside an apple/container--virtualizationdev container.make dist-x86_64produces a deployment tarball (cctl + cloud-hypervisor + virtiofsd + initfs + kernel) cross-compiled from the aarch64 dev container; pipeline documented indocs/x86_64-build.md. Static-musl C deps and the Zig cross compiler are pinned by SHA256.The host orchestrator runs as root. Per-VM runtime state lives under
/run/containerization/ch/<UUID>with mode 0700; UDS sockets inside are bound with mode 0600. Vminitd's gRPC channel inherits that trust boundary — socket-file perms are the auth.Sandbox flags are upstream-secure by default. Two per-component opt-outs exist for the apple/container dev-container case (where the host seccomp profile SIGSYS-kills CH and virtiofsd):
CONTAINERIZATION_NO_CH_SECCOMP=1—cloud-hypervisor --seccomp false.CONTAINERIZATION_NO_VIRTIOFSD_SANDBOX=1—virtiofsd --sandbox none. Each emits a one-shotlogger.warningat process start. Legacy aliasCONTAINERIZATION_RELAXED_SANDBOX=1flips both. cctl spawns both binaries withsetsidand a minimal env allowlist (PATH / HOME / RUST_LOG / RUST_BACKTRACE) so the parent's secrets don't leak to children.make linux-integrationruns the cross-platform integration suite against a real cloud-hypervisor VM inside the dev container. Linux runs the cross-platform subset (process true/false/echo hi, virtiofs round-trip, hotplug); the macOS suite is unchanged.