← Back to playground

πŸ”₯Firecracker

A senior engineer's design doc for AWS's microVM monitor β€” what it is, how it works, and the parts the official docs gloss over.

TL;DR

  • Firecracker is a minimalist KVM-based virtual machine monitor written in Rust. It boots a Linux guest in roughly 125 ms with around 5 MiB of memory overhead per microVM, by emulating a small set of VirtIO devices and a couple of legacy ones β€” versus QEMU's forty-plus.
  • It exists to fill the gap between containers and traditional VMs: hardware-grade isolation at container-grade density. It powers AWS Lambda and Fargate, where AWS packs many tenants onto shared hosts.
  • The design is opinionated to the point of austerity: no PCI bus by default, no graphical console, no SCSI, no USB, no live migration, no nested virtualization. Everything not on the multi-tenant serverless critical path is excluded so the auditable attack surface stays small.

Why Firecracker exists

The serverless promise β€” "you give us a function, we run it in milliseconds, we bill you for the milliseconds" β€” runs into a workload-isolation problem. If two customers' functions share a Linux kernel, then any kernel CVE is a cross-tenant CVE. If they share a network namespace, any iptables bug is a tenant boundary bug. Linux's namespace-and-cgroup container model is a workable isolation story for one organization running its own workloads. It is a stretched story for a public cloud running arbitrary code from arbitrary customers.

The pre-2018 alternative was to run each function in its own KVM/QEMU guest. That gave hardware-grade isolation β€” the guest sees only what nested page tables expose, the guest kernel cannot syscall into the host kernel β€” but at unacceptable cost. A cold QEMU VM takes seconds to boot and a hundred-plus megabytes of overhead per VM. At Lambda's density (millions of microVMs across a fleet), neither number is acceptable.

Firecracker is the synthesis. It uses the same hardware mechanism as QEMU β€” KVM β€” but strips the userspace VMM down to the minimum required for a serverless workload. The result is roughly 125 ms boot, ~5 MiB overhead, and a codebase small enough that AWS can credibly say they have audited it. It is open-source, written in Rust, and has been in production at Lambda since 2018 and Fargate since 2020.

Context

Firecracker started from Google's crosvm (the VMM behind Chrome OS), then helped spin up the rust-vmm community as a place to share common crates between the two projects. The shared lineage is visible in the crate layout β€” both pull from rust-vmm for KVM bindings, VirtIO queues, etc.

Mental model in 60 seconds

From the bottom up, a running Firecracker microVM looks like this:

Host CPU β€” VT-x / AMD-SVM + EPT / NPT (nested paging) Host Linux kernel KVM module JAILER chroot Β· pivot_root Β· cgroups Β· uid/gid drop Β· FD scrub FIRECRACKER PROCESS (one per microVM) API thread REST on Unix socket VMM thread epoll over device FDs vCPU threads Γ— N each runs KVM_RUN ioctl loop VIRTIO DEVICES (host-side emulation) block Β· net Β· vsock Β· balloon Β· rng Β· pmem Β· mem   + legacy serial / i8042 Guest Linux kernel sees only virtio devices + serial + APIC (+ PL031 RTC on aarch64) Guest workload (Lambda fn, Fargate task, …)
A running microVM. Everything below the guest workload is part of Firecracker's responsibility.

A few things worth noting up front:

  • One Firecracker process per microVM. There is no "host daemon" managing many VMs. Each microVM is a separate Linux process. This is the unit of failure isolation.
  • The jailer is a separate binary, not a library. It runs first, sets up the sandbox, then execves Firecracker. Once running, the Firecracker process is what's exposed to the guest; the jailer has already exited.
  • The guest sees a deliberately small machine. MMIO transport by default; an opt-in PCIe transport shipped in v1.13 (--enable-pci). No SCSI. No USB. No GPU. The guest boots via the PVH direct-boot path (when the kernel supports it) or the Linux 64-bit boot protocol β€” there is no SeaBIOS or OVMF.
  • The control plane is a REST API over a Unix socket. No libvirt, no XML, no command-line flags for VM config. You PUT JSON to /machine-config, /drives/<id>, etc., then PUT an InstanceStart action to /actions.

Process & threading model

A running Firecracker microVM has 2 + N threads, where N is the number of guest vCPUs:

  • API thread. Accepts HTTP/1.1 over the Unix socket at /run/firecracker.socket. Parses JSON, validates, and forwards configuration requests to the VMM thread via an in-process channel. Has its own seccomp filter (it never needs to call KVM).
  • VMM thread. Runs an epoll event loop over file descriptors for emulated devices (TAP for net, file FDs for block, Unix socket for vsock), the API channel, signal-handling FDs, and a metrics timer. This is where device emulation actually happens.
  • vCPU thread Γ— N. Each guest vCPU has a dedicated host thread that calls ioctl(KVM_RUN) in a loop. When the guest exits to the host (MMIO, port I/O, halt, hypercall), the vCPU thread handles the exit inline β€” reads and writes are dispatched to the device bus on the same thread, with no cross-thread message to the VMM thread on the hot path.
jailer (parent, exits) forks & execs firecracker firecracker (pid 1 inside chroot, jailed) API thread HTTP/1.1 on /run/.../socket seccomp: parse + IPC only VMM thread epoll over device FDs seccomp: I/O + signaling vCPU threads Γ— N loop: ioctl(KVM_RUN, …) seccomp: KVM ioctls only β†’ in-process channels (mpsc); shared Arc<Mutex<Vmm>> for state mutation β†’ each thread has its own seccomp BPF filter installed before guest code runs β†’ kernel module KVM is invoked only via the vCPU threads' ioctl loop β†’ process inherits jailer's restricted view: chrooted FS, dropped uid, cgroup-bound CPU/RAM
Process structure of a running microVM. Two permanent threads plus one per vCPU.

This is deliberately small. There is no thread pool, no async runtime, no background worker for device queues. Each VirtIO queue is serviced on the VMM thread when the guest kicks the eventfd. Block I/O is synchronous unless you opt into the async (io_uring) backend β€” which Firecracker added relatively recently for throughput-sensitive workloads.

Official docs say

"The process runs the following threads: API, VMM and vCPU(s)." That's it β€” three sentences for what's actually one of the more important architectural choices. The implication, which the design doc never makes explicit, is that built-in device emulation is single-threaded: one slow vsock packet handler can stall every other virtio queue on the VMM thread. (The opt-in vhost-user backend moves a device's data plane to a separate backend process, escaping this constraint.) This rarely matters in Lambda-shaped workloads, but it matters if you try to push high I/O.

The minimalist device surface

Almost everything the guest can talk to is a VirtIO device. VirtIO is paravirtualization: the guest knows it's virtualized and talks to the hypervisor via shared-memory ring queues instead of pretending the hypervisor is a 1990s SCSI controller. That choice eliminates a lot of complicated, bug-prone device emulation.

Here is the full host-visible device list as of late 2025:

DeviceTransportWhat it does
virtio-blockMMIO (default) / PCIeBlock storage; one device per attached drive. Sync or io_uring-backed.
virtio-netMMIO (default) / PCIeNetworking; one device per host TAP. Token-bucket rate limiting in process.
virtio-vsockMMIO (default) / PCIeBidirectional byte stream between guest and host; bridges AF_VSOCK to AF_UNIX.
virtio-balloonMMIO (default) / PCIeMemory reclaim: host asks guest to inflate, guest gives pages back.
virtio-rngMMIO (default) / PCIeEntropy source backed by host /dev/urandom.
virtio-pmemMMIO (default) / PCIePersistent memory device; useful for read-mostly rootfs images.
virtio-memMMIO (default) / PCIeMemory hotplug β€” grow guest RAM after boot by plugging in additional regions.
16550A serialport I/O (x86_64) / MMIO (aarch64)Console output. On aarch64 it's the only non-virtio device besides the RTC.
i8042 (partial)port I/Ox86_64 only. Implements just enough to deliver a CPU reset.
RTC (pl031)MMIOaarch64 only. Real-time clock; alarms not supported.
VMGenIDACPI / MMIO128-bit ID exposed via ACPI; the guest CSPRNG reseeds when it changes (e.g. on snapshot restore).
ACPI tablesfirmwareMinimal tables for power management and CPU/device enumeration.

What's not here is the more interesting list. There's no graphical console β€” no VGA, no QXL, no virtio-gpu. There's no USB. There's no SCSI controller. The PCIe transport is opt-in (--enable-pci, shipped in v1.13); without that flag every virtio device sits on the MMIO bus. There's no NVMe emulation. There's no audio. There's no IDE / ATA. There's no floppy or CD-ROM. There's no OVMF or SeaBIOS β€” Firecracker uses the PVH direct-boot protocol (or the Linux 64-bit boot protocol as a fallback) to jump straight into the Linux kernel.

Each device you don't emulate is a device whose 30,000 lines of C in QEMU you don't have to audit. VENOM (the 2015 floppy controller CVE in QEMU) is the canonical reminder that legacy device code is where VM escapes are found.

How devices attach to a microVM

Devices are configured one at a time via the API before InstanceStart:

# Set up a block device
PUT /drives/rootfs HTTP/1.1
{
  "drive_id": "rootfs",
  "path_on_host": "/var/lib/fc/rootfs.ext4",
  "is_root_device": true,
  "is_read_only": false,
  "rate_limiter": { "bandwidth": { "size": 10485760, "refill_time": 1000 } }
}

# Set up a network interface
PUT /network-interfaces/eth0 HTTP/1.1
{
  "iface_id": "eth0",
  "host_dev_name": "fc-tap0",
  "guest_mac": "06:00:AC:10:00:02"
}

# Boot
PUT /actions HTTP/1.1
{ "action_type": "InstanceStart" }

Most configuration is immutable after boot. You cannot hot-add a vCPU, change the kernel, or replace the rootfs of a running microVM. The exceptions are narrow: rate limiters and the balloon size can be tuned at runtime, block devices can be hot-plugged when PCIe is enabled (--enable-pci, v1.13+), virtio-mem can grow guest RAM, and you can write to the MMDS data store.

The boot path

The canonical Firecracker number β€” quoted in SPECIFICATION.md β€” is ≀125 ms cold boot, measured from InstanceStart to the start of guest /sbin/init with the serial console disabled and a minimal kernel / rootfs. Snapshot restore is much faster again, in the low tens of milliseconds. The path:

t = 0 guest userspace jailer setup ~5 ms API thread up socket bind PUT /machine-config PUT /boot-source, /drives, /net-ifs PUT /actions {InstanceStart} kernel loaded β†’ vCPU run PVH entry point 1. Jailer sets up cgroups, unshares namespaces, chroots, drops uid/gid, then execves firecracker. 2. Firecracker installs the API-thread seccomp filter, binds the Unix socket, and blocks waiting for config. 3. Orchestrator sends 4–6 sequential PUTs to configure the machine. Each is validated against vmm_config types. 4. InstanceStart triggers vmm::builder::build_microvm_for_boot: create KVM VM, mmap guest memory, load kernel image, set up vCPUs and registers, attach devices, install vCPU + VMM seccomp filters, then resume vCPUs. 5. vCPU thread(s) call ioctl(KVM_RUN). Guest enters at the PVH (or Linux-64) entry point, mounts rootfs, exec /sbin/init.
Cold boot from execve to guest userspace. Snapshot restore replaces steps 3–5 with a single deserialize+resume call.

A few details worth knowing:

  • Kernel loading is direct. Firecracker reads an uncompressed kernel image β€” an ELF vmlinux on x86_64, a PE image on aarch64 β€” copies it into guest physical memory, and sets the vCPU's instruction pointer to the entry point. There is no firmware to boot through. On x86_64 the loader prefers the PVH entry note when the kernel has one, falling back to the Linux 64-bit boot protocol otherwise.
  • Guest memory is anonymous-mapped on the host. Faulted in on first touch by the guest. This is what makes snapshot restore fast β€” the same trick applies to a restored snapshot, except the backing is a file-mapped region instead of anonymous.
  • Seccomp filters are installed at multiple points. The API thread gets its filter when it starts. The vCPU threads get their filter after the KVM file descriptor is opened but before KVM_RUN β€” so an exploited vCPU thread cannot open arbitrary files. The VMM thread's filter is installed last, right before build_microvm_for_boot returns and the vCPUs start running β€” by that point all device file descriptors are already open, and the filter narrows what the VMM thread can do during the run.

Isolation: defense in depth

Firecracker stacks five mechanisms between the guest and the host. None of them is sufficient alone; together they make a guest-to-host escape require breaking every layer.

Host kernel (last line of defense) Jailer: chroot + pivot_root + cgroups + namespaces + uid drop Seccomp BPF: ~24–50 allowed syscalls per thread KVM: vmexits β†’ ioctl handlers; narrow kernel surface Hardware: VT-x/SVM, EPT/NPT β€” guest cannot see host memory Guest workload (treated as arbitrary, possibly hostile code)
Defense in depth: each layer is independent. An attacker has to defeat all five.

Layer 1 β€” Hardware (VT-x / SVM + EPT / NPT)

The CPU's virtualization extensions and nested page tables enforce that the guest sees only the physical pages KVM has mapped for it. Privileged instructions in the guest trap to the host's KVM module. This is the foundation; without it nothing else matters. Notably, it does not defend against microarchitectural side channels (Spectre, Meltdown family, MDS, etc.). Those are mitigated by separate means β€” kernel mitigations, SSBD, optional CPU template scrubbing.

Layer 2 β€” KVM

The Linux kernel module that exposes virtualization to userspace via ioctls. It is the most security-critical piece of the host kernel from Firecracker's perspective. KVM itself is much smaller than the rest of the kernel (tens of thousands of LOC vs millions), but it is not trivial β€” there is a steady stream of KVM CVEs and Firecracker's threat model explicitly accepts host-kernel exposure as a residual risk.

Layer 3 β€” Seccomp BPF

Each Firecracker thread runs under its own seccomp filter that allowlists the small set of syscalls it actually needs. The default filters under resources/seccomp/ permit ~24 syscalls for vCPU threads (the narrowest filter β€” just KVM ioctls and a handful of memory-management calls), ~31 for the API thread, and ~50 for the VMM thread. The filters are compiled at build time from JSON specifications by the seccompiler tool in the same repo. They are installed before guest code runs, and because seccomp filters cannot be relaxed once installed, a compromised vCPU thread cannot expand its own privileges.

Gotcha

Custom seccomp filters via --seccomp-filter are easy to get wrong. The official docs explicitly warn that misconfigured filters can crash Firecracker, and they offer --no-seccomp only for prototyping. Don't ship that flag.

Layer 4 β€” Jailer

The jailer is a separate binary that runs before Firecracker. Its job is to set up a sandbox and then execve Firecracker into it. Specifically, the jailer:

  • Creates a cgroup (v1 or v2) and writes the requested CPU, memory, and I/O limits.
  • Calls unshare(CLONE_NEWNS) to get its own mount namespace, then pivot_roots into a per-microVM chroot directory.
  • Optionally creates a PID namespace (off by default, via --new-pid-ns). Optionally joins a pre-existing network namespace (via --netns <path> + setns); the jailer does not create one.
  • Closes every file descriptor except stdin/stdout/stderr and a few that Firecracker needs (/dev/kvm, the API socket, the TAP, log/metrics pipes).
  • Drops to a non-root uid/gid. Once running, Firecracker has no host root privileges.
  • execves Firecracker. In the default path this is a plain exec β€” the jailer process simply becomes Firecracker, with no parent left behind. With --new-pid-ns it forks a child to exec Firecracker as PID 1 in the new namespace, and the original jailer process exits.

Note what the jailer is not: it's not a container runtime. There's no OCI bundle, no image layering, no built-in network plumbing. The jailer is roughly 3,000 lines of Rust that does the smallest possible useful set of process-isolation work and then gets out of the way.

Layer 5 β€” Host kernel

If everything above fails, the only thing standing between a guest and total host compromise is the host kernel itself. Firecracker's design acknowledges this: the threat model explicitly assumes a hardened host kernel and treats host kernel CVEs as out of scope for Firecracker to mitigate. Hosts running Firecracker in production should be on a recent LTS kernel with the standard hardening: kpti, KASLR, SMEP/SMAP, all relevant Spectre mitigations, and a small attack surface (no unneeded modules).

Threat model, formalized

This is the section the official docs gesture at but never write down. The CHARTER.md says customer workloads are "simultaneously considered sacred (shall not be touched) and malicious (shall be defended against)" β€” a great framing, but not a threat model. Here is what one looks like for Firecracker:

Trust zones

ZoneTrust levelBoundary enforced by
Guest userspaceUntrustedGuest kernel
Guest kernelUntrustedKVM + EPT (hardware) + VirtIO surface
Firecracker processSemi-trustedSeccomp + jailer + cgroups
Host kernelTrusted(out of Firecracker's scope to defend)
Orchestrator / control planeTrustedOS-level access controls on the API socket

Attacker capabilities

  • The attacker controls arbitrary code in guest userspace. They can issue any syscall to the guest kernel.
  • The attacker is assumed to be able to compromise the guest kernel. A guest kernel root is in-scope β€” Firecracker still must contain the damage.
  • The attacker cannot directly control the orchestrator or the Firecracker API socket. (If they can, the model breaks; nothing in Firecracker defends against a malicious operator.)
  • The attacker can co-locate workloads with other tenants on the same host. They can attempt side-channel attacks and resource-exhaustion attacks against neighbors.

What Firecracker defends

  • Host kernel integrity from guest code paths, via KVM + seccomp narrowing the syscall surface from "everything Linux exposes" to "~30 syscalls per thread."
  • Host filesystem from guest code paths, via the jailer's chroot + pivot_root + FD scrub.
  • Other tenants' memory and CPU time, via per-microVM EPT mappings, per-process address spaces, and cgroup-enforced CPU/RAM shares.
  • Host network plane from guest network plane, via per-VM TAP devices and netfilter rules on the host (not provided by Firecracker, but assumed by its model).
  • Sensitive guest data at rest, in the limited sense that Firecracker never reads the contents of guest memory or disk except to serialize a snapshot.

What Firecracker explicitly does not defend

  • Microarchitectural side channels in the general case. Spectre v1/v2/v4, MDS, L1TF, Retbleed, and successors are mitigated as a layered concern (kernel mitigations, optional CPU isolation, optional CPU templates that mask sibling-thread-leaking features) β€” but Firecracker does not claim immunity. Multi-tenant operators are expected to handle sibling-thread leakage out-of-band β€” for example by pinning each microVM to its own physical core or core complex on SMT-vulnerable CPUs (the approach Lambda is widely reported to take, though AWS has not published the exact policy).
  • Host kernel CVEs. If the attacker finds a Linux KVM bug, Firecracker offers no additional protection beyond the seccomp narrowing.
  • Supply-chain attacks on Firecracker or its dependencies. A compromised rust-vmm crate is game over.
  • Malicious operator scenarios. Anyone with write access to the API socket can do anything to the microVM. Privilege separation between an orchestrator and the workload is the orchestrator's job, not Firecracker's.
  • Denial of service via legitimate-looking resource use, beyond what cgroups enforce. A pathological guest workload that maxes out its cgroup share is not an attack from Firecracker's perspective.
Why this matters

When you read "Firecracker provides VM-grade isolation," what's actually being claimed is the first list, not the second. The second list is where multi-tenant cloud operators add their own controls β€” dedicated cores, careful host kernel selection, network segmentation, signed binaries. Firecracker is one piece of a defense system, not the whole system.

Snapshots & the cold-start trick

This is arguably the feature that makes Firecracker work as the Lambda substrate. A microVM that takes 200 ms to boot is fine for a long-running Fargate task; it is unacceptable for a Lambda function that is supposed to feel synchronous to the user. Snapshots let you pre-boot a microVM up to the point where it's ready to handle a request, freeze it, and restore copies of that frozen state in milliseconds.

Anatomy of a snapshot

A Firecracker snapshot is 2 files (+ external disk images) memory file full guest RAM, sparse on disk restored via mmap(MAP_PRIVATE); pages faulted in on guest access (or via userfaultfd handler) state file vCPU registers KVM internal state device state machines virtio queues interrupt controller + CRC64 (corruption check) disk images NOT in the snapshot; caller manages externally snapshots can be reused across microVMs by also cloning the disk via CoW
What's in a snapshot, and what isn't.

The memory file is the big one β€” it's the same size as the guest's RAM allocation. But because it's mmaped with MAP_PRIVATE on restore, pages are only physically materialized when the guest actually touches them. A freshly restored 512 MiB microVM might cost the host only a few MiB of RSS until the workload starts touching memory.

What's not in the snapshot

  • Block-device contents. The host backing file is referenced by path, not by content. To clone a snapshot across N microVMs you also need to clone the rootfs β€” usually via copy-on-write (overlayfs, dm-snapshot, or a filesystem with reflink support).
  • Open network connections. The guest's TCP state is in guest memory, so it survives the snapshot. But the host's TAP and any NAT state are gone after restore; existing TCP sessions break. Long-lived connections do not survive snapshots.
  • MMDS data store contents. The MMDS network config (MAC, IP, ports) is persisted, but the key-value data store itself is not β€” so guest-specific values like instance IDs don't leak across microVMs that share a snapshot.
  • vsock connections. At snapshot time the virtio-vsock device sends a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET to the guest; the guest driver tears down active connections on resume. Only listening sockets survive (with their CID rewritten to the restored guest_cid).

The reuse problem

If you snapshot a microVM and then restore N copies of it, every copy starts with identical RNG state, identical kernel boot-time entropy, identical TLS session caches, identical machine IDs. This is bad. Two microVMs deriving the same TLS server's keypair from the same seed will produce the same keys.

The mitigation is VMGenID, a 128-bit ID exposed through ACPI that changes when a snapshot is restored. A guest kernel that knows about VMGenID (Linux 5.18+ has it) will reseed its CSPRNG on change. The userspace reuse problem isn't solved by VMGenID β€” applications that cached secrets in memory before the snapshot still need to know to re-derive them. The Firecracker docs flag this explicitly and the responsibility is firmly on the workload author.

In practice

Lambda's snapshot-based "SnapStart" feature solves this by running an explicit re-init hook after each restore β€” your Java runtime gets a callback that says "you were just restored, regenerate per-instance secrets now." Don't assume the kernel-level VMGenID is enough for your workload.

Restore performance

Cold restore is fast because almost no work happens up front: the state file is small (kilobytes to a few MB), memory is mmaped lazily, and KVM resumes the vCPUs from their saved register state. Reported restore-to-first-instruction latencies are in the low tens of milliseconds. Throughput-sensitive workloads will then pay the page-fault cost as memory warms up; for cold-start-sensitive workloads like Lambda, this trade is exactly right.

For workloads that want more control over the restore path β€” e.g., to fetch pages from a remote store, or to track which pages are still un-touched β€” Firecracker supports a userfaultfd backend instead of plain mmap. An external handler process receives a notification on every guest page fault and decides what to return.

MMDS, vsock, networking

MMDS β€” the in-guest metadata service

The microVM Metadata Service is Firecracker's equivalent of AWS EC2's IMDS. The orchestrator PUTs a JSON document to the Firecracker API; the guest GETs it from 169.254.169.254 by default (the IMDS magic address β€” configurable per VM via the ipv4_address field in /mmds/config) over one of its configured network interfaces. Two versions:

  • v1: simple unauthenticated GET. Deprecated for the same reason EC2 deprecated IMDSv1 β€” SSRF attacks against guest workloads.
  • v2: session-token-required, modelled on EC2 IMDSv2. The native headers are X-metadata-token{,-ttl-seconds}, and the EC2 names (X-aws-ec2-metadata-token…) are accepted as aliases. The data store response defaults to a JSON payload; setting imds_compat: true switches to EC2's IMDS text format so unmodified AWS SDK code works against it.

MMDS is implemented inside Firecracker's network device emulation: packets to the magic address are intercepted at the virtio-net layer before they hit the host TAP and answered by the in-process HTTP responder. The data store is not persisted across snapshots, by design.

vsock β€” host-guest IPC

Standard AF_VSOCK in the guest, but Firecracker translates each end to an AF_UNIX socket on the host:

  • Host-initiated: orchestrator connects to a Unix socket, writes CONNECT <port>\n, Firecracker opens an AF_VSOCK channel to the guest on that port and bridges bytes.
  • Guest-initiated: guest opens an AF_VSOCK connection to host CID 2, Firecracker connects to a Unix socket at <uds_path>_<port> on the host.

Useful when you want a control channel between an orchestrator and the guest that doesn't go over the network. Lambda uses it for the runtime's communication with the Lambda invoke service.

Networking

Each virtio-net device is backed by a host TAP device. Firecracker does no network plumbing β€” it doesn't create the TAP, doesn't set up routes, doesn't manage NAT or bridges. That's the orchestrator's job: create a TAP, attach it to a bridge (or a netns with NAT rules), then tell Firecracker the TAP name.

Within Firecracker, the only network feature is token-bucket rate limiting per device, configurable in bandwidth (bytes/second) and ops (packets/second), with both rates having independent burst capacity. The rate limiter sits between the guest's virtio queue and the host TAP write; packets that exceed the budget are queued, not dropped.

Firecracker vs QEMU, gVisor, runc

This comparison gets misframed often. These aren't equivalent technologies; they're points on a spectrum of isolation strength traded against resource cost:

FirecrackerQEMUgVisorrunc / containers
Isolation mechanismKVM + hardware virtKVM + hardware virtUser-space syscall interceptionLinux namespaces + cgroups
Shared kernel with host?No β€” separate guest kernelNo β€” separate guest kernelNo β€” Sentry intercepts syscallsYes β€” shared host kernel
Devices emulated~8 VirtIO + serial/i804240+ (PCI, USB, SCSI, GPU, …)None β€” syscall-levelNone β€” bare kernel access
Threads per workload2 + N vCPUsMany; variesSeveral (Sentry, Gofer)Shared with parent
Boot / start time~125 ms (cold), <50 ms (snapshot)seconds~100 ms~10–50 ms
RAM overhead~5 MiB50–200 MiB~15 MiB~1 MiB
Designed for multi-tenant?YesOptionalYesSingle-tenant by default
Codebase~115k LOC Rustmillions LOC C~150k LOC Go~50k LOC Go
Live migrationNoYesNon/a (process)
Open sourceApache 2.0GPL v2Apache 2.0Apache 2.0

vs QEMU

Same fundamental mechanism (KVM), radically different surface area. QEMU is a general-purpose machine emulator: it can boot a 386, a PowerPC, an ARM SoC, a Raspberry Pi. Firecracker can boot exactly one machine β€” a Linux guest on x86_64 or aarch64, using the PVH entry where available and the Linux 64-bit boot protocol otherwise. The formally supported guest kernels are 5.10 and 6.1 LTS (per docs/kernel-policy.md); newer kernels often work but aren't covered by the policy. If you want to run Windows, run an unmodified existing VM image, or hot-add a USB controller, you want QEMU. If you want to run thousands of identical Linux serverless workloads with minimal overhead, you want Firecracker. (Notably, QEMU itself can be configured with the microvm machine type for similar minimalism β€” Firecracker's value over that path is the Rust safety story and the operational opinionatedness.)

vs gVisor

gVisor is the opposite philosophy: instead of giving the guest its own kernel, gVisor intercepts the guest's syscalls in userspace ("Sentry") and re-implements them. The result is a much smaller per-workload memory footprint and no hardware-virt requirement, but at the cost of slower syscalls and incomplete Linux compatibility. Firecracker says: "trust the Linux kernel completely, contain it inside hardware virtualization." gVisor says: "don't trust the Linux kernel at all, re-implement it in safer code." Different teams reach different conclusions on the same problem; both are deployed at scale.

vs containers (runc)

Not really comparable. A container shares the host kernel. Every syscall the workload makes is executed by the same kernel that other tenants' workloads also rely on. For trusted workloads, that's fine β€” and faster than anything else listed here. For untrusted multi-tenant workloads, it's not a defensible boundary on its own. The industry convention is to nest: run a container inside a Firecracker microVM (Fargate's model) or inside a gVisor sandbox.

Where Firecracker fits at AWS

Firecracker is open-source but built and primarily maintained by AWS. The three production workloads it underpins:

  • AWS Lambda. Every function invocation runs in a Firecracker microVM. Lambda's pre-2018 architecture used dedicated EC2 instances per concurrent tenant; the move to Firecracker is what enabled per-function microVMs at the density Lambda needs. Snapshot-based "SnapStart" (launched for Java in late 2022, since broadened to Python and .NET) is built on Firecracker's snapshot/restore.
  • AWS Fargate (ECS). Each ECS Fargate task runs inside a Firecracker microVM; the user's container runs inside that microVM. So Fargate-on-ECS is "containers in microVMs," and the security boundary that matters is the microVM, not the container. (EKS Fargate's data plane is less openly documented and is not confirmed to use Firecracker in the same way.)
  • AWS App Runner and a number of other internal services use Firecracker under the hood, though the public posture varies.

Outside AWS, Firecracker has become the default substrate for a generation of serverless and edge-compute platforms β€” Fly.io, Koyeb, Northflank, and others. Kata Containers supports Firecracker as a VMM backend for Kubernetes pods that want VM-grade isolation. The OSS adoption is real and ongoing.

Reading the source

If you want to verify any of the above against the actual code, here's where to look in firecracker-microvm/firecracker:

  • src/firecracker/src/main.rs β€” process entry, CLI parsing, signal handler installation, branch into API vs no-API mode.
  • src/vmm/src/lib.rs β€” VMM crate root; the top-level module structure mirrors the architecture (arch, devices, device_manager, vstate, vmm_config, builder, snapshot, rpc_interface).
  • src/vmm/src/builder.rs β€” build_microvm_for_boot (cold path) and build_microvm_from_snapshot (restore) turn a configured set of resources into a running VM. build_and_boot_microvm is the top-level entry that wraps the cold path. Read these to see the boot path in code.
  • src/vmm/src/device_manager/mod.rs β€” where devices are wired into the bus and into the IRQ chip.
  • src/vmm/src/devices/virtio/ β€” one directory per VirtIO device: block/, net/, vsock/, balloon/, rng/, pmem/, mem/, plus the vhost_user backend. Each contains a state machine for the virtio queue plus the device-specific logic.
  • src/vmm/src/devices/legacy/ β€” serial UART, i8042, RTC. Small and focused.
  • src/vmm/src/vmm_config/ β€” the JSON-serializable types that back the REST API. Reading these gives you the configuration surface in one place.
  • src/jailer/src/main.rs β€” the jailer binary. Self-contained and short; worth reading end-to-end.
  • src/seccompiler/ β€” the seccomp filter compiler. Filter rules live under resources/seccomp/ as JSON.
  • src/vmm/src/vstate/ β€” KVM setup, vCPU thread management, memory regions.
  • src/vmm/src/snapshot/ β€” snapshot serialization, persistence format, restore logic.

Further reading

  • The Firecracker NSDI 2020 paper β€” Agache et al., "Firecracker: Lightweight Virtualization for Serverless Applications." The authoritative description of the design goals and how they trade off. The honest place to read about boot-time and density numbers.
  • docs/ in the repo β€” the official design.md plus dedicated docs for jailer, snapshots, vsock, MMDS, hugepages, networking, metrics, tracing. Inconsistently maintained, but always the most current.
  • CHARTER.md β€” the "sacred and malicious" framing in its original form. Short and worth reading.
  • AWS re:Invent talks SVS404 / SVS402 β€” multiple years' worth of Lambda-architecture talks have details on how Firecracker is deployed in practice (host packing, snapshot warm-pool, SnapStart's invalidation hook).
  • rust-vmm β€” the upstream crate ecosystem Firecracker depends on. If you want to write your own KVM-based VMM in Rust, this is where to start.