Benchmarking
Capsem includes capsem-bench, a Python benchmarking tool that runs inside the VM. It outputs rich tables to stderr for humans and saves structured JSON to /tmp/capsem-benchmark.json for machine consumption.
Running benchmarks
Section titled “Running benchmarks”just bench # All benchmarks in VM (~2 min)just run "capsem-bench disk" # Disk I/O onlyjust run "capsem-bench rootfs" # Rootfs reads onlyjust run "capsem-bench startup" # CLI cold-start onlyjust run "capsem-bench http" # HTTP through proxyjust run "capsem-bench throughput" # 100MB downloadjust run "capsem-bench snapshot" # Snapshot operations onlyjust full-test # Full validation including benchmarksBoot timing
Section titled “Boot timing”Boot timing is measured independently from capsem-bench. The guest init script (capsem-init) records the wall-clock duration of each boot stage using /proc/uptime. The PTY agent sends these measurements to the host over the vsock control channel, where they are displayed as an inline table with a proportional bar chart.
Measured stages
Section titled “Measured stages”| Stage | What happens |
|---|---|
squashfs | Mount the compressed read-only rootfs from the virtio block device |
virtiofs | Mount the VirtioFS shared directory from the host |
overlayfs | Create the overlay filesystem (ext4 loopback upper + squashfs lower) |
workspace | Bind-mount /root from the VirtioFS workspace |
network | Configure dummy0 interface, dnsmasq, iptables redirect rules |
net_proxy | Start the TCP-to-vsock proxy for HTTPS interception |
deploy | Copy MCP server, capsem-doctor, capsem-bench, and diagnostics from initrd |
venv | Create the Python virtualenv (uses uv for speed) |
agent_start | Launch the PTY agent and connect vsock ports |
Invariant
Section titled “Invariant”The diagnostic suite enforces that total boot time stays under 1 second (test_environment.py::test_boot_time_under_1s). Stages exceeding 500ms are flagged as slow. The most common regression is venv — if uv is missing from the rootfs, Python falls back to python3 -m venv which is ~10x slower.
Benchmark categories
Section titled “Benchmark categories”Disk I/O (disk)
Section titled “Disk I/O (disk)”Measures scratch disk performance in /root (VirtioFS-backed workspace).
| Test | Method | Metric |
|---|---|---|
| Sequential write | Write 256MB in 1MB blocks, fdatasync at end | Throughput (MB/s) |
| Sequential read | Read 256MB in 1MB blocks after drop_caches | Throughput (MB/s) |
| Random 4K write | 10,000 random pwrite calls on 64MB file, fdatasync per write | IOPS, throughput |
| Random 4K read | 10,000 random pread calls on 64MB file after drop_caches | IOPS, throughput |
Write test size is configurable via CAPSEM_BENCH_SIZE_MB (default: 256).
Rootfs reads (rootfs)
Section titled “Rootfs reads (rootfs)”Measures read performance on the compressed squashfs rootfs where binaries and libraries live.
| Test | Method | Metric |
|---|---|---|
| Sequential read | Read the largest file in /usr/bin, /usr/lib, /opt/ai-clis in 1MB blocks | Throughput (MB/s) |
| Random 4K read | 5,000 random pread calls across all rootfs files (>4KB) | IOPS, throughput |
CLI cold-start (startup)
Section titled “CLI cold-start (startup)”Measures wall-clock time to run <cli> --version with page cache dropped between runs. Each command is timed 3 times.
| Command | What it tests |
|---|---|
python3 --version | CPython interpreter startup |
node --version | Node.js runtime startup |
claude --version | Claude Code CLI (Node-based) |
gemini --version | Gemini CLI (Node-based) |
codex --version | Codex CLI (native binary + Node) |
HTTP (http)
Section titled “HTTP (http)”Measures HTTP throughput through the MITM proxy using concurrent GET requests.
- Default: 50 requests to
https://www.google.com/with concurrency 5 - Custom:
capsem-bench http <URL> <N> <C> - Reports: successful/failed count, requests/sec, latency percentiles (p50, p95, p99, min, max)
Each worker thread uses a persistent requests.Session. Latency includes the full round-trip: guest -> net-proxy -> vsock -> host MITM proxy -> internet -> response back.
Proxy throughput (throughput)
Section titled “Proxy throughput (throughput)”Downloads a 100MB file through the MITM proxy and reports end-to-end throughput.
Uses curl to download https://ash-speed.hetzner.com/100MB.bin. This measures the maximum sustained bandwidth the proxy pipeline can deliver, including TLS termination, body inspection, and re-encryption.
Snapshot operations (snapshot)
Section titled “Snapshot operations (snapshot)”End-to-end latency for snapshot operations via the MCP gateway. Tests at 3 workspace sizes (10, 100, 500 files of 4KB each):
| Operation | What it does |
|---|---|
create | Populate workspace, create a named snapshot via snapshots create |
list | List all snapshots with change diffs |
changes | List files changed since the last checkpoint |
revert | Revert a single modified file from the snapshot |
delete | Delete the snapshot |
Each operation is measured as the full round-trip: guest CLI -> MCP server (NDJSON over vsock) -> host gateway -> APFS filesystem operation -> response back to guest.
JSON output
Section titled “JSON output”All benchmarks save structured JSON to /tmp/capsem-benchmark.json inside the VM:
{ "version": "0.3.0", "timestamp": 1711561234.5, "hostname": "capsem", "disk": { "seq_write": { "throughput_mbps": 1180, ... }, ... }, "rootfs": { ... }, "startup": { "commands": { "python3": { "mean_ms": 9.0 }, ... } }, "http": { "requests_per_sec": 58, "latency_ms": { "p50": 67, ... } }, "throughput": { "throughput_mbps": 34.3, ... }, "snapshot": { "10_files": { "create_ms": 879, ... }, ... }}Adding a new benchmark
Section titled “Adding a new benchmark”- Create a new module in
guest/artifacts/capsem_bench/(e.g.,mytest.py) with amytest_bench()function that returns a dict and prints a Rich table to stderr - Add the mode name to
VALID_MODESincapsem_bench/__main__.py - Wire it into
main()with theif mode in ("name", "all"):pattern (lazy import) - Update the
dev-benchmarkskill and this page