Use latest master to enable memory hotplug. Changes: c1e6d00 ci: Add memory resizing use case to vhost-user tests 890582b ci: Factorize kernel command line 4de2584 ci: Fix mmio tests with direct kernel boot f268246 ci: Factorize integration tests booting from vhost-user-blk 5a5b3cf ci: Factorize vhost-user-blk integration tests dd8debf ci: Run vhost-user-blk tests for mmio builds 0c9c72c ci: Unify vhost-user-blk integration tests c95851f ci: Run vhost-user-net tests for mmio transport 68293fc ci: Factorize vhost-user-net one step further d75e745 vm-virtio: vhost-user: Send memory update to the backend 7ff82af vm-virtio: vhost-user: Factorize SET_MEM_TABLE setup e54f8ec vmm: Update memory through DeviceManager bc874a9 vm-virtio: Add update_memory() to VirtioDevice trait 93becca build(deps): bump backtrace from 0.3.45 to 0.3.46 feb8d7a vmm: Separate seccomp filters between VMM and API threads 5120c27 main: Add seccomp support f1a23d7 vmm: api: Add seccomp to the HTTP API thread db62cb3 vmm: Add seccomp filter to the VMM thread cb98d90 vmm: Create new seccomp_filter module 708f02d vmm: Pull seccomp crate from Firecracker 18fbd30 vhost-user-fs: return correct result of fs_slave_io() bbc385c devices: ioapic: Remove unused MsiMessage structure 2fc86ff dev_cli: Always pull the latest container image 4b462a5 Dockerfile: Add cpio and bsdtar to the container image 8acc15a build: Bump vm-memory and linux-loader dependencies 38ed560 build(deps): bump thiserror from 1.0.12 to 1.0.13 9f67de4 build(deps): bump proc-macro-hack from 0.5.12 to 0.5.14 ebab809 build(deps): bump thiserror from 1.0.11 to 1.0.12 c67e407 build(deps): bump syn from 1.0.16 to 1.0.17 bdcfe1e tests: Add "discard_writes" pmem test 7098602 tests: Make the test_virtio_pmem test use a temporary file f7197e8 vmm: Add a "discard_writes=" to --pmem d11a67b vmm: Use more generic MmapRegion constructor 7257e89 vmm: Add "readonly" parameter MemoryManager::create_userspace_mapping 03cb26c release: v0.6.0 3e9a39c github: Upload the ch-remote asset c503118 vmm: fix a corrupted stack caused by get_win_size 0788600 build: Remove "pvh_boot" feature flag 477bc17 bin: Share VFIO device syntax between cloud-hypervisor and ch-remote 96be2db build(deps): bump serde_derive from 1.0.104 to 1.0.105 5a335fc build(deps): bump serde from 1.0.104 to 1.0.105 a31ffef openapi: Add hotplug_size for memory hotplug 87990f9 vmm: Add virtio-pci device to B/D/F hash table fb185fa vmm: Always return PCI B/D/F from add_virtio_pci_device 462082c build(deps): bump arc-swap from 0.4.4 to 0.4.5 c821e96 vhost_user_fs: Implement support for FUSE_LSEEK 5aa9abc docs: Add document for vhost-user-net test with OVS/DPDK 6329219 vm-virtio: queue: Use a SeqCst fence on get_used_event 63eeed2 vm: Comment on the VM config update from memory hotplug 0895bcb build(deps): bump proc-macro-hack from 0.5.11 to 0.5.12 0541f5a build(deps): bump proc-macro-nested from 0.1.3 to 0.1.4 51f51ea build(deps): bump libc from 0.2.67 to 0.2.68 9cf67d1 arch: x86: Always set the bootloader type ad35470 arch: x86: Extract common bootparams settings 28a5f9d vmm: acpi: Remove unused IORT related structures 5c1207c vhost-user-fs: handle FS_IO request f61f78e build(deps): bump anyhow from 1.0.26 to 1.0.27 efb2447 pvh: Add integration test to validate PVH boot da084fa pvh: Add unit tests for initial sregs and control registers 64941bf pvh: Add unit tests for start_info and memory map structures 9e247c4 pvh: Introduce "pvh_boot" feature a22bc35 pvh: Write start_info structure to guest memory 840a9a9 pvh: Initialize vCPU regs/sregs for PVH boot 24f0e42 pvh: Introduce EntryPoint struct 98b9568 pvh: Add definitions for PVH boot protocol support 6e6ef83 build: Fix log dependency 291f1ce build(deps): bump linux-loader from `0c754f3` to `0ce5bfa` 07cc73b vhost_user_fs: add a flag to disable extended attributes 710520e vhost_user_fs: Process requests in parallel with a thread pool 90309b5 vm-virtio: queue: Add methods to switch a descriptor context 2294c2d Add .rustfmt.toml to the project 48c4885 vhost_user_fs: replace HandleData's File Mutex with RwLock 134e64c arch, qcow: Fix 1.42.0 clippy warnings 6ea85ca resources: Dockerfile: Update Rust toolchain 4579afa vmm: For --disk error if socket and path is specified 7e599b4 vmm: Make disk path optional 477d924 github: Build from a rust toolchain matrix 4f2469e main: Remove "--vhost-user-net" 8d785bb pci: Fix the PciBus using HashMap instead of Vec 04f2ccd build(deps): bump ryu from 1.0.2 to 1.0.3 02265bb build(deps): bump regex-syntax from 0.6.16 to 0.6.17 40b38a4 openapi: Make desired_ram int64 format ca3b39c bin: Fix wrapping in help strings ee1ba56 build: Use "wrap_help" feature for clap 3957d1e vhost_user_backend: call get_used_event from needs_notification 536323d vm-virtio: queue: hint that get_used_event should be inlined 401e1d2 vm-virtio: queue: fix a barrier comment at update_avail_event e0bdfe8 vm-virtio: queue: add a missing memory barrier in get_used_event df2570a resources: Simplify kernel config filename 9ab648b resources: Enable VIRTIO_MEM support 0339853 ci: Bump to kernel 5.6-rc4 abccf76 tests: Use ch-remote to add/remove devices in test_vfio 5c3ce9d tests: Extend ch-remote helper to support optional single argument 9a7d9c9 ch-remote: Support removing VFIO devices 0d53ba4 ch-remote: Support adding VFIO devices babefbd main: Remove spurious second help line for "--device" 63c5d09 github: Trigger the build job on PRs 8cbb6d0 github: Replace Travis CI with github actions efba48d vmm: Don't put a VFIO device behind the vIOMMU by default 34412c9 vmm: Add id option to VFIO hotplug 18dc916 vmm: Switch to the micro-http package 9023444 vmm: Add id field to --device through CLI f4a956a vmm: Remove 32 bits MMIO range from correct address space 432eb5b vmm: Free PCI BARs when unplugging PCI device f0dff8b vfio: pci: Remove KVM user memory region when cleaning up 34d1f43 vfio: pci: Implement free_bars() from the PciDevice trait b8e1cf2 vm-allocator: Add new function to free 32 bits MMIO address space f3dc245 pci: Extend PciDevice trait with new free_bars() method 911a2d6 tests: Use ch-remote to resize the VM 21160f7 ch-remote: Add "resize" command bb2d04b ch-remote: Add support for sending a request body bde4f73 ch-remote: Refactor HTTP response handling 6ed23bb build(deps): bump micro_http from `9bbde4f` to `6b3e5f0` 5edd812 build(deps): bump backtrace-sys from 0.1.33 to 0.1.34 f727714 ci: Add integration test for VFIO hot-unplug b50cbe5 pci: Give PCI device ID back when removing a device df71aae pci: Make the device ID allocation smarter e514b12 vmm: Update VmConfig when removing VFIO device 81173bf vmm: Add id field to DeviceConfig structure 6cbdb9a vmm: api: Introduce new "remove-device" HTTP endpoint 991f3bb vmm: Remove VFIO device from everywhere it is referenced 6adebbc vmm: Detect when guest notifies about ejecting PCI device 0e21c32 devices: Add new method to remove all occurrences of a BusDevice f8e2008 pci: Add a function to remove a PciDevice from the bus 08604ac vmm: Store PCI devices as Any devices from DeviceManager 0f99d3f vmm: Store VFIO device's name and its PCI b/d/f 13a61c4 build(deps): bump rand_chacha from 0.2.1 to 0.2.2 fcd605a build(deps): bump micro_http from `6d416af` to `9bbde4f` 30b6954 vm-virtio: Consume pause events to prevent infinite epoll_wait calls 16fd506 tests: Use new ch-remote for pause/resume integration test ba8cd4d bin: Introduce "ch-remote" for controlling VMM 06cd31c build(deps): bump micro_http from `02def92` to `6d416af` 7e941c9 build(deps): bump linux-loader from `8cb7c66` to `0c754f3` Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Table of Contents
- What is it?
- Background
- Out of scope
- Design
- API
- Networking
- Storage
- Devices
- Developers
- Persistent storage plugin support
- Experimental features
What is it?
virtcontainers is a Go library that can be used to build hardware-virtualized container
runtimes.
Background
The few existing VM-based container runtimes (Clear Containers, runV, rkt's
KVM stage 1) all share the same hardware virtualization semantics but use different
code bases to implement them. virtcontainers's goal is to factorize this code into
a common Go library.
Ideally, VM-based container runtime implementations would become translation
layers from the runtime specification they implement (e.g. the OCI runtime-spec
or the Kubernetes CRI) to the virtcontainers API.
virtcontainers was used as a foundational package for the Clear Containers runtime implementation.
Out of scope
Implementing a container runtime is out of scope for this project. Any tools or executables in this repository are only provided for demonstration or testing purposes.
virtcontainers and Kubernetes CRI
virtcontainers's API is loosely inspired by the Kubernetes CRI because
we believe it provides the right level of abstractions for containerized sandboxes.
However, despite the API similarities between the two projects, the goal of
virtcontainers is not to build a CRI implementation, but instead to provide a
generic, runtime-specification agnostic, hardware-virtualized containers
library that other projects could leverage to implement CRI themselves.
Design
Sandboxes
The virtcontainers execution unit is a sandbox, i.e. virtcontainers users start sandboxes where
containers will be running.
virtcontainers creates a sandbox by starting a virtual machine and setting the sandbox
up within that environment. Starting a sandbox means launching all containers with
the VM sandbox runtime environment.
Hypervisors
The virtcontainers package relies on hypervisors to start and stop virtual machine where
sandboxes will be running. An hypervisor is defined by an Hypervisor interface implementation,
and the default implementation is the QEMU one.
Update cloud-hypervisor client code
See docs
Agents
During the lifecycle of a container, the runtime running on the host needs to interact with
the virtual machine guest OS in order to start new commands to be executed as part of a given
container workload, set new networking routes or interfaces, fetch a container standard or
error output, and so on.
There are many existing and potential solutions to resolve that problem and virtcontainers abstracts
this through the Agent interface.
Shim
In some cases the runtime will need a translation shim between the higher level container stack (e.g. Docker) and the virtual machine holding the container workload. This is needed for container stacks that make strong assumptions on the nature of the container they're monitoring. In cases where they assume containers are simply regular host processes, a shim layer is needed to translate host specific semantics into e.g. agent controlled virtual machine ones.
Proxy
When hardware virtualized containers have limited I/O multiplexing capabilities, runtimes may decide to rely on an external host proxy to support cases where several runtime instances are talking to the same container.
API
The high level virtcontainers API is the following one:
Sandbox API
-
CreateSandbox(sandboxConfig SandboxConfig)creates a Sandbox. The virtual machine is started and the Sandbox is prepared. -
DeleteSandbox(sandboxID string)deletes a Sandbox. The virtual machine is shut down and all information related to the Sandbox are removed. The function will fail if the Sandbox is running. In that caseStopSandbox()has to be called first. -
StartSandbox(sandboxID string)starts an already created Sandbox. The Sandbox and all its containers are started. -
RunSandbox(sandboxConfig SandboxConfig)creates and starts a Sandbox. This performsCreateSandbox()+StartSandbox(). -
StopSandbox(sandboxID string)stops an already running Sandbox. The Sandbox and all its containers are stopped. -
PauseSandbox(sandboxID string)pauses an existing Sandbox. -
ResumeSandbox(sandboxID string)resume a paused Sandbox. -
StatusSandbox(sandboxID string)returns a detailed Sandbox status. -
ListSandbox()lists all Sandboxes on the host. It returns a detailed status for every Sandbox.
Container API
-
CreateContainer(sandboxID string, containerConfig ContainerConfig)creates a Container on an existing Sandbox. -
DeleteContainer(sandboxID, containerID string)deletes a Container from a Sandbox. If the Container is running it has to be stopped first. -
StartContainer(sandboxID, containerID string)starts an already created Container. The Sandbox has to be running. -
StopContainer(sandboxID, containerID string)stops an already running Container. -
EnterContainer(sandboxID, containerID string, cmd Cmd)enters an already running Container and runs a given command. -
StatusContainer(sandboxID, containerID string)returns a detailed Container status. -
KillContainer(sandboxID, containerID string, signal syscall.Signal, all bool)sends a signal to all or one container inside a Sandbox.
An example tool using the virtcontainers API is provided in the hack/virtc package.
For further details, see the API documentation.
Networking
virtcontainers supports the 2 major container networking models: the Container Network Model (CNM) and the Container Network Interface (CNI).
Typically the former is the Docker default networking model while the later is used on Kubernetes deployments.
CNM
CNM lifecycle
-
RequestPool -
CreateNetwork -
RequestAddress -
CreateEndPoint -
CreateContainer -
Create
config.json -
Create PID and network namespace
-
ProcessExternalKey -
JoinEndPoint -
LaunchContainer -
Launch
-
Run container
Runtime network setup with CNM
-
Read
config.json -
Create the network namespace (code)
-
Call the prestart hook (from inside the netns) (code)
-
Scan network interfaces inside netns and get the name of the interface created by prestart hook (code)
-
Create bridge, TAP, and link all together with network interface previously created (code)
-
Start VM inside the netns and start the container (code)
Drawbacks of CNM
There are three drawbacks about using CNM instead of CNI:
- The way we call into it is not very explicit: Have to re-exec
dockerdbinary so that it can accept parameters and execute the prestart hook related to network setup. - Implicit way to designate the network namespace: Instead of explicitly giving the netns to
dockerd, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the VETH pair will be created with the wrong netns. - No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the
CreateSandboxpath, which is critical for starting the VM as quick as possible.
Storage
Container workloads are shared with the virtualized environment through 9pfs. The devicemapper storage driver is a special case. The driver uses dedicated block devices rather than formatted filesystems, and operates at the block level rather than the file level. This knowledge has been used to directly use the underlying block device instead of the overlay file system for the container root file system. The block device maps to the top read-write layer for the overlay. This approach gives much better I/O performance compared to using 9pfs to share the container file system.
The approach above does introduce a limitation in terms of dynamic file copy in/out of the container via docker cp operations.
The copy operation from host to container accesses the mounted file system on the host side. This is not expected to work and may lead to inconsistencies as the block device will be simultaneously written to, from two different mounts.
The copy operation from container to host will work, provided the user calls sync(1) from within the container prior to the copy to make sure any outstanding cached data is written to the block device.
docker cp [OPTIONS] CONTAINER:SRC_PATH HOST:DEST_PATH
docker cp [OPTIONS] HOST:SRC_PATH CONTAINER:DEST_PATH
Ability to hotplug block devices has been added, which makes it possible to use block devices for containers started after the VM has been launched.
How to check if container uses devicemapper block device as its rootfs
Start a container. Call mount(8) within the container. You should see / mounted on /dev/vda device.
Devices
Support has been added to pass VFIO assigned devices on the docker command line with --device. Support for passing other devices including block devices with --device has not been added added yet.
How to pass a device using VFIO-passthrough
- Requirements
IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups. VFIO uses this information to enforce safe ownership of devices for userspace.
You will need Intel VT-d capable hardware. Check if IOMMU is enabled in your host
kernel by verifying CONFIG_VFIO_NOIOMMU is not in the kernel configuration. If it is set,
you will need to rebuild your kernel.
The following kernel configuration options need to be enabled:
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m
In addition, you need to pass intel_iommu=on on the kernel command line.
- Identify BDF(Bus-Device-Function) of the PCI device to be assigned.
$ lspci -D | grep -e Ethernet -e Network
0000:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
$ BDF=0000:01:00.0
- Find vendor and device id.
$ lspci -n -s $BDF
01:00.0 0200: 8086:1528 (rev 01)
- Find IOMMU group.
$ readlink /sys/bus/pci/devices/$BDF/iommu_group
../../../../kernel/iommu_groups/16
- Unbind the device from host driver.
$ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind
- Bind the device to
vfio-pci.
$ sudo modprobe vfio-pci
$ echo 8086 1528 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
$ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind
- Check
/dev/vfio
$ ls /dev/vfio
16 vfio
- Start a Clear Containers container passing the VFIO group on the docker command line.
docker run -it --device=/dev/vfio/16 centos/tools bash
- Running
lspciwithin the container should show the device among the PCI devices. The driver for the device needs to be present within the Clear Containers kernel. If the driver is missing, you can add it to your custom container kernel using the osbuilder tooling.
Developers
For information on how to build, develop and test virtcontainers, see the
developer documentation.
Persistent storage plugin support
See the persistent storage plugin documentation.
Experimental features
See the experimental features documentation.

