fc-cri Architecture & Design Document
Executive Summary
fc-cri is a custom CRI (Container Runtime Interface) runtime that runs containers inside Firecracker microVMs. It's designed as a lightweight alternative to Kata Containers, targeting significantly lower resource overhead while maintaining VM-level isolation.
Goals
- Memory: 64-128MB per VM (vs 160MB+ for Kata)
- Cold Start: <150ms (vs 500ms+ for Kata)
- Warm Start: <50ms using VM pooling
- Simplicity: Direct Firecracker integration, minimal abstraction layers
Non-Goals
- Full Kata compatibility
- Support for multiple hypervisors (Firecracker only)
- Windows containers
Background & Motivation
Why Not Kata Containers?
Kata Containers is the established solution for running containers in VMs on Kubernetes. However, it has significant overhead:
| Resource | Kata Containers | Target for fc-cri |
|---|---|---|
| Memory baseline | 160MB+ | 64-128MB |
| Guest agent | ~50MB (kata-agent) | ~2-3MB (fc-agent) |
| Cold start time | 500-800ms | <150ms |
| Warm start time | N/A | <50ms |
| Supported hypervisors | QEMU, Cloud-Hypervisor, Firecracker | Firecracker only |
Kata's overhead comes from:
- Generic VMM abstraction - Supports multiple hypervisors, adding complexity
- Full-featured guest agent - kata-agent is feature-rich but heavy
- No VM pooling - Every pod starts a fresh VM
- Complex storage - Multiple layers of abstraction for image handling
Why Firecracker?
Firecracker is AWS's microVM hypervisor, used in Lambda and Fargate. Key properties:
- Minimal attack surface - Purpose-built for multi-tenant isolation
- Fast boot - <125ms to kernel boot
- Low memory - ~5MB VMM overhead
- Simple API - REST/socket API for VM management
- No legacy support - No BIOS, no PCI, no USB = smaller kernel
Use Cases
- Multi-tenant SaaS - Run untrusted customer code with VM isolation
- CI/CD pipelines - Isolated build environments
- Serverless platforms - Fast-starting isolated execution environments
- Compliance workloads - When container isolation isn't sufficient
Architecture Overview
System Context
┌─────────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Control Plane │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ API Server │ │ Scheduler │ │ Controller Manager │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Schedule Pod with │
│ │ runtimeClassName: firecracker │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Worker Node │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ kubelet │ │ │
│ │ │ │ │ │ │
│ │ │ CRI (gRPC) │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────────────────────────────────────────┐ │ │ │
│ │ │ │ containerd │ │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ │ Runtime Selection │ │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ │ ┌─────────────┴─────────────┐ │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ ▼ ▼ │ │ │ │
│ │ │ │ ┌──────────┐ ┌─────────────┐ │ │ │ │
│ │ │ │ │ runc │ │ fc-cri │ │ │ │ │
│ │ │ │ │ (default)│ │ shim │ │ │ │ │
│ │ │ │ └──────────┘ └─────────────┘ │ │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ └────────────────────────────────────┼───────────┘ │ │ │
│ │ │ │ │ │ │
│ │ └────────────────────────────────────────┼───────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ Firecracker VMM │ │ │
│ │ │ ┌──────────────────────────────────────────────────┐ │ │ │
│ │ │ │ microVM │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ ┌─────────────┐ ┌──────────────────────┐ │ │ │ │
│ │ │ │ │ fc-agent │◄────►│ Container (runc) │ │ │ │ │
│ │ │ │ └─────────────┘ └──────────────────────┘ │ │ │ │
│ │ │ │ ▲ │ │ │ │
│ │ │ └──────────┼────────────────────────────────────────┘ │ │ │
│ │ │ │ vsock │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────┐ │ │ │
│ │ │ │ VM Pool │ (pre-warmed VMs) │ │ │
│ │ │ └─────────────┘ │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Component Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ fc-cri │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ containerd-shim-fc-v2 │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ Shim Service │ │ VM Manager │ │ Agent Client │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - Task lifecycle │ │ - Create/Stop VM │ │ - JSON-RPC/vsock │ │ │
│ │ │ - Event publish │ │ - Snapshot mgmt │ │ - Container ops │ │ │
│ │ │ - State tracking │ │ - Resource cfg │ │ - Exec/Attach │ │ │
│ │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │ │
│ │ │ │ │ │ │
│ │ └─────────────────────┼─────────────────────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────────────────┴───────────────────────────────┐ │ │
│ │ │ VM Pool │ │ │
│ │ │ │ │ │
│ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│ │ │
│ │ │ │ Warm VM │ │ Warm VM │ │ Warm VM │ │ Warm VM │ │ Warm VM ││ │ │
│ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘│ │ │
│ │ │ │ │ │
│ │ │ - Acquire() → O(1) VM retrieval │ │ │
│ │ │ - Release() → Return to pool or destroy │ │ │
│ │ │ - Auto-replenish background goroutine │ │ │
│ │ │ - Idle cleanup (configurable max idle time) │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ Network (CNI) │ │ Image Service │ │ Config Manager │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - TAP devices │ │ - OCI pull │ │ - TOML config │ │ │
│ │ │ - Bridge setup │ │ - Layer flatten │ │ - Defaults │ │ │
│ │ │ - IP assignment │ │ - ext4 creation │ │ - Validation │ │ │
│ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ fc-agent (in VM) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ vsock Server │ │ Container Mgr │ │ Stats/Cgroups │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - JSON-RPC proto │ │ - runc create │ │ - CPU usage │ │ │
│ │ │ - Request router │ │ - runc start │ │ - Memory usage │ │ │
│ │ │ - Conn handling │ │ - runc exec │ │ - I/O stats │ │ │
│ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Domain Model
Following domain-driven design principles, the core domain is modeled around these key concepts:
Aggregate Roots
Sandbox (Pod Sandbox / microVM)
The Sandbox is the aggregate root representing a Firecracker microVM that hosts one or more containers.
type Sandbox struct {
// Identity
ID string
Name string
Namespace string
// VM State
State SandboxState // Pending → Ready → Stopped
VM *firecracker.Machine
VMConfig VMConfig
// Communication
VsockPath string
VsockCID uint32
AgentConn net.Conn
// Networking
NetworkNamespace string
IP net.IP
// Containers within this sandbox
Containers map[string]*Container
}
Container
A Container represents a container running inside a Sandbox (microVM).
type Container struct {
ID string
SandboxID string
Name string
Image string
State ContainerState // Created → Running → Exited
PID int
ExitCode int32
// Configuration
Command []string
Env []string
Mounts []Mount
Resources ResourceConfig
}
Value Objects
VMConfig
Immutable configuration for creating a Firecracker VM.
type VMConfig struct {
VcpuCount int64 // Default: 1
MemoryMB int64 // Default: 128
KernelPath string
KernelArgs string
RootDrive DriveConfig
NetworkMode string // "cni" or "none"
VsockEnabled bool
}
Domain Services
| Service | Responsibility |
|---|---|
VMManager |
Create, stop, destroy Firecracker VMs |
VMPool |
Pre-warm VMs for fast acquisition |
AgentClient |
Communicate with guest agent via vsock |
NetworkService |
CNI-based network setup and teardown |
ImageService |
OCI image pull and block device conversion |
Key Design Decisions
1. containerd Shim v2 (Not Full CRI Server)
Decision: Implement a containerd shim rather than a standalone CRI server.
Rationale:
- containerd handles CRI protocol, image management, and storage
- Shim only handles runtime-specific logic
- Less code to maintain, better integration
- containerd manages shim lifecycle automatically
Trade-off: Tightly coupled to containerd (can't use with CRI-O directly)
2. VM Pooling for Fast Starts
Decision: Pre-warm VMs and maintain a pool of ready-to-use instances.
Rationale:
- VM creation is the slowest part (~150ms even with Firecracker)
- Pool provides O(1) VM acquisition
- Enables <50ms pod start times
Implementation:
type Pool struct {
available chan *Sandbox // Ready VMs
inUse map[string]*Sandbox
config PoolConfig
}
func (p *Pool) Acquire(ctx context.Context, config VMConfig) (*Sandbox, error) {
select {
case sandbox := <-p.available:
// Got pre-warmed VM - customize and return
return p.customizeVM(sandbox, config)
default:
// Pool empty - create fresh
return p.manager.CreateVM(ctx, config)
}
}
Configuration:
[pool]
enabled = true
size = 5 # VMs to keep ready
min_size = 2 # Minimum to maintain
max_idle_time = "5m"
3. Minimal Guest Agent
Decision: Build a custom minimal agent instead of using kata-agent.
Rationale:
- kata-agent is ~50MB, ours is ~2-3MB
- Simple JSON-RPC protocol over vsock
- Only implements what we need
- Static binary, no runtime dependencies
Protocol:
// Request
{"id": 1, "method": "create_container", "params": {"id": "abc", "bundle": "/..."}}
// Response
{"id": 1, "result": {"status": "created"}}
Supported Methods:
ping- Health checkcreate_container- Create container via runcstart_container- Start container, return PIDstop_container- Stop with timeout, then SIGKILLremove_container- Delete containerexec_sync- Synchronous execget_stats- Cgroup statistics
4. Block Device Storage (Not Overlayfs)
Decision: Convert OCI images to ext4 block devices.
Rationale:
- Firecracker doesn't support filesystem sharing (no 9p, no virtiofs)
- Block devices (virtio-blk) are fast and simple
- Sparse files minimize disk usage
Flow:
Future Optimization: Use device mapper thin provisioning for copy-on-write efficiency.
5. Minimal Kernel Configuration
Decision: Build a custom minimal kernel (~5MB uncompressed).
Rationale:
- Stock kernels are 30-50MB
- We only need: virtio, vsock, ext4, cgroups, namespaces, netfilter
- Faster boot, smaller memory footprint
Trade-offs:
- Reduced Compatibility: Missing drivers for XFS, ZFS, SCTP, and specialized hardware.
- Mitigation: Users can supply their own kernel via
config.toml(see Operations Guide).
Key Config:
CONFIG_VIRTIO_MMIO=y # Firecracker uses MMIO, not PCI
CONFIG_VIRTIO_BLK=y # Block devices
CONFIG_VIRTIO_NET=y # Networking
CONFIG_VIRTIO_VSOCKETS=y # Host communication
CONFIG_EXT4_FS=y # Rootfs
CONFIG_OVERLAY_FS=y # Container layers
CONFIG_CGROUPS=y # Resource limits
CONFIG_NAMESPACES=y # Container isolation
CONFIG_PCI=n # Not needed
CONFIG_USB_SUPPORT=n # Not needed
CONFIG_SOUND=n # Not needed
Implementation Status
Completed Components
| Component | Description |
|---|---|
| Domain Model | Core entities, value objects, and service interfaces. |
| VM Manager | Lifecycle management (create, stop, destroy) using firecracker-go-sdk. |
| VM Pool | Pre-warming, acquisition, release, and auto-replenishment logic. |
| Shim Service | Implementation of containerd shim v2 task API. |
| Agent Client | Host-side JSON-RPC client for guest communication. |
| Guest Agent | Minimal static binary handling vsock communication and runc integration. |
| CNI Network | Network namespace management and CNI plugin invocation. |
| Image Service | OCI image pull and conversion to ext4 block devices (fsify). |
| Hot-Attach | Dynamic attachment of workload rootfs to pooled VMs. |
| Snapshot Restore | Fast VM restoration from memory snapshots. |
| Jailer | Production security hardening (chroot, cgroups, seccomp). |
| Metrics | Prometheus metrics for pool stats, latencies, and errors. |
| CLI Tool | fcctl for inspection and debugging. |
Future Work
| Component | Priority | Notes |
|---|---|---|
| Devmapper | High | Alternative storage backend for thin provisioning. |
| ARM64 | Medium | Support for Graviton/Ampere instances. |
| GPU | Low | Passthrough support for ML workloads. |
| PVM | Low | Pagetable-based Virtual Machine (Research). Support for running without nested virtualization. |
File Structure
firecracker-cri/
├── cmd/
│ ├── containerd-shim-fc-v2/
│ │ └── main.go # Shim entry point
│ └── fc-agent/
│ └── main.go # Guest agent (static binary)
│
├── pkg/
│ ├── domain/
│ │ └── types.go # Core domain model
│ ├── vm/
│ │ ├── manager.go # VM lifecycle management
│ │ └── pool.go # Pre-warming pool
│ ├── shim/
│ │ └── service.go # containerd shim v2 service
│ ├── agent/
│ │ └── client.go # Guest agent client
│ ├── network/
│ │ └── cni.go # CNI integration
│ └── image/
│ └── rootfs.go # OCI to block device
│
├── kernel/
│ ├── config-minimal # Kernel configuration
│ └── build.sh # Kernel build script
│
├── config/
│ ├── fc-cri.toml # Runtime configuration
│ └── containerd-fc.toml # containerd integration
│
├── deploy/
│ └── kubernetes/
│ ├── runtime-class.yaml # Kubernetes RuntimeClass
│ └── example-pod.yaml # Usage examples
│
├── scripts/
│ └── create-rootfs.sh # Base rootfs creation
│
├── docs/
│ └── ARCHITECTURE.md # This document
│
├── go.mod
├── Makefile
└── README.md
Getting Started
Prerequisites
# Required
- Linux with KVM support (check: ls /dev/kvm)
- containerd 1.6+
- Go 1.22+
- Root access for installation
# Optional
- Docker (for building rootfs)
- crictl (for testing)
Build & Install
# Clone the repository
git clone https://github.com/pipeops/firecracker-cri.git
cd firecracker-cri
# Build binaries
make build
# Build kernel (one-time, ~10 minutes)
make kernel
# Create base rootfs
make rootfs
# Install
sudo make install
# Restart containerd
sudo systemctl restart containerd
Test with Kubernetes
# Apply RuntimeClass
kubectl apply -f deploy/kubernetes/runtime-class.yaml
# Label node as fc-cri enabled
kubectl label node <node-name> fc-cri.io/enabled=true
# Run a test pod
kubectl apply -f deploy/kubernetes/example-pod.yaml
# Check pod status
kubectl get pod secure-workload
# Verify it's running in a Firecracker VM
kubectl describe pod secure-workload | grep -A5 "Events"
Performance Tuning
Pool Sizing
[pool]
size = 10 # For high-throughput: increase
min_size = 5 # For consistent latency: increase
max_idle_time = "10m" # For cost savings: decrease
Memory Optimization
[vm]
memory_mb = 64 # Minimum for small workloads
# memory_mb = 128 # Default, good for most
# memory_mb = 256 # For memory-heavy apps
Kernel Args
kernel_args = "console=ttyS0 reboot=k panic=1 pci=off quiet loglevel=0"
# Add 'quiet loglevel=0' for faster boot (less console output)
Security Considerations
VM Isolation
- Each pod runs in a separate Firecracker microVM
- Hardware-level isolation via KVM
- Separate kernel, memory space, and filesystem
Jailer (Production)
Enable the jailer for additional security:
The jailer provides:
- Chroot isolation for Firecracker process
- Seccomp filtering
- Cgroup enforcement
- Dropped privileges
Network Isolation
- Each VM gets its own network namespace
- CNI handles network policy enforcement
- No shared network stack with host
Comparison with Alternatives
| Feature | fc-cri | Kata Containers | gVisor |
|---|---|---|---|
| Isolation | VM | VM | Syscall filtering |
| Memory overhead | 64-128MB | 160MB+ | 50-100MB |
| Cold start | <150ms | 500ms+ | <100ms |
| Compatibility | High | High | Medium |
| Hypervisor | Firecracker | Multiple | None |
| Complexity | Low | High | Medium |
| Production ready | In progress | Yes | Yes |
Contributing
- Fork the repository
- Create a feature branch
- Write tests
- Submit a pull request
Code Style
- Follow Go conventions
- Domain-driven design principles
- Clear separation of concerns
- Comprehensive error handling
References
License
Apache 2.0