Move to master tip to get support for vfio hotplug. Changes: df79499 net: Do not check multiqueue for new interface 7d75b1f build(deps): bump quote from 1.0.2 to 1.0.3 841bf89 build(deps): bump failure from 0.1.6 to 0.1.7 86acdb9 build(deps): bump failure_derive from 0.1.6 to 0.1.7 4b32863 docs: Update api.md for VFIO hotplug e518098 scripts: Make integration tests fail if some important commands fail be6f91d tests: Refactoring vhost_user_net test cases 6341736 vhost_user_net: Provide tap option for vhost_user_net backend e0419e9 build: Don't cancel older master builds f0a3e7c build: Bump linux-loader and vm-memory dependencies 6539d4a vfio: handle case for missing iommu_group cfbebd8 build(deps): bump micro_http from `88011bd` to `02def92` 4214806 tests: Remove further use of sudo subshells 2baf5ab tests: Simplfy the shm region check 97affbe tests: Re-enable the virtio-fs tests and make them work with virtio-mmio 7b1d5c1 tests: Remove entropy check from vhost-user-block test a4cca5f tests: sha1sums --check can take a list of hashes 689415e build(deps): bump libssh2-sys from 0.2.15 to 0.2.16 09829c4 vmm: Remove IO bus strong reference from Vm 2dbb376 vmm: Remove all Weak references from DeviceManager 9e915a0 vmm: Remove all Weak references from CpuManager 49268bf pci: Remove all Weak references from PciBus ca426cf devices: Make Bus hold a list of Weak BusDevice references 7773812 vmm: Store the list of BusDevice devices from DeviceManager d0820cc vmm: Make add_vfio_device mutable 948f808 vm: Rename DeviceManager field in Vm structure aa638ea build(deps): bump backtrace from 0.3.44 to 0.3.45 1152b1a ci: Add VFIO hotplug integration test d47f733 vmm: Break the cyclic dependency between DeviceManager and IO bus c1af13e vmm: Update VmConfig when adding new device a86f436 vmm: Add VFIO PCI device hotplug support 320fea0 vmm: Factorize VFIO PCI device creation 00716f9 vmm: Store virtio-iommu device from DeviceManager 5902dfa vmm: Store VFIO KVM device from DeviceManager d9c1b43 vmm: Store MSI InterruptManager from DeviceManager 02adc40 vmm: Store PciBus from DeviceManager 3f396d8 resources: Enable ACPI PCI hotplug in the kernel config d0218e9 vmm: Trigger hotplug notification to the guest 0e58741 vmm: api: Introduce new "add-device" HTTP endpoint 0f1396a vmm: Insert PCI device hotplug operation region on IO bus 65774e8 vmm: Implement BusDevice for DeviceManager 2eb26d4 devices: acpi: Update GED to support PCI devices hotplug 8dbc843 vmm: acpi: Add PCNT method to invoke DVNT c62db97 vmm: acpi: Add _EJ0 to each PCI device slot 4dc2a39 vmm: acpi: Create PHPR container c3a0685 vmm: acpi: Add notification method for PCI device slots 5a68d5b vmm: acpi: Create PCI device slots ead86bb build(deps): bump micro_http from `9945928` to `88011bd` 22dd49d tests: Test virtio-fs with virtio-mmio 642b890 vm-virtio: mmio: Enable reporting of SHM regions via config fields 0223cf8 ci: Update ClearLinux image ed396b4 build(deps): bump vm-memory from `2099f41` to `a84a7b8` 81c2294 vhost_rs: remove unused crate 5200bf3 Cargo: switch vhost_rs to external crate 65a38e6 vm-virtio: vhost_user: Fix blk device configuration space offset value d6e6901 vmm/api: Fix vm.info response definition 8f37200 build(deps): bump micro_http from `3eb926c` to `9945928` cc2d03d build(deps): bump regex-syntax from 0.6.15 to 0.6.16 f5b37e3 build(deps): bump regex-syntax from 0.6.14 to 0.6.15 009f4d2 build(deps): bump micro_http from `8d48e73` to `3eb926c` 5ade9d4 tests: Remove unnecessary sleeps and kill on clean shutdown tests c98949b tests: Wait for VMM to exit in test_serial_file/test_console_file 2f58fb8 tests: Test rebooting works for block self spawn test e817aa6 tests: Improve VM shutdown behaviour 559b70c tests: Make output capture optional dae7608 tests: Remove duplicated network configuration 6466ad2 tests: Remove duplicated disk configuration 9f1ac24 tests: Make the GuestCommand take a reference to the guest 49e70c6 tests: Port integration tests over to GuestCommand 67a5882 tests: Introduce new GuestCommand to handle launching the guest 8142c82 vmm: Move DeviceManager into an Arc<Mutex<>> 531f4ff vhost_user_fs: Remove an unneeded unwrap in handle_event e52129e vhost_user_fs: Process events from HIPRIO queue 0c5c470 build(deps): bump micro_http from `b85757e` to `8d48e73` 5b96dd5 ci: Don't give special capabilities to Rust vhost-user-fs backend d8d790b vhost_rs: Don't check for SLAVE_SEND_FD on SET_SLAVE_REQ_FD 1c5562b vhost_user_fs: Add support for EVENT_IDX eae4f1d vhost_user_fs: Add support for indirect descriptors ea0bc24 vhost_user_fs: Be honest about protocol supported features 42937c9 vm-virtio: Add support for indirect descriptors d7b0b98 tests: Move integration tests to their own directory 3cb4513 vhost_rs: control SlaveFsCacheReq with vhost-user-slave feature 9de3ace devices: implement Aml trait for GED device b77fdeb msi/msi-x: Prevent from losing masked interrupts 8423c08 build(deps): bump proc-macro2 from 1.0.8 to 1.0.9 6315f16 build(deps): bump syn from 1.0.15 to 1.0.16 4cf89d3 pci: handle extended configuration space properly f6b9445 pci: fix pci MMCONFIG address parsing 77ee331 resources: Enable KASLR in kernel config bba5ef3 vmm: Remove deprecated CPU syntax 374ac77 main, vmm: Remove deprecated --vhost-user-net ffd816e main, vmm: Remove deprecated --vhost-user-blk d04e0dc build(deps): bump crossbeam-utils from 0.7.0 to 0.7.2 7da5b53 build(deps): bump ssh2 from 0.7.1 to 0.8.0 109c7f7 build(deps): bump hermit-abi from 0.1.7 to 0.1.8 812a6b9 build(deps): bump syn from 1.0.14 to 1.0.15 ad30791 build(deps): bump memchr from 2.3.2 to 2.3.3 94f2fc3 release-notes: Update for v0.5.1 bug fix release f190cb0 build(deps): bump libc from 0.2.66 to 0.2.67 299eb28 build(deps): bump micro_http from `6fd1545` to `b85757e` d2f1749 vmm: config: Add poll_queue property to DiskConfig 378dd81 vmm: openapi: Add missing "direct" knob to DiskConfig 056f548 vmm: openapi: Fix "readonly" and "wce" defaults in DiskConfig 4ebf01b vhost_user_backend: Don't report out socket broken errors b5755e9 vhost_rs: vhost_user: Return error when connection broken c49e31a vmm: api: Return a resize error when resize fails ebc6391 vmm: api: Fix resize command typos 9de7553 vmm: openapi: Update DiskConfig ed1e781 vmm: Workaround double reboot triggered by the kernel 5c06b7f vhost_user_block: Implement optional static polling 0e4e27e vhost_user_block: Make use of the EVENT_IDX feature 1ef6996 vhost_user_backend: Add helpers for EVENT_IDX d17fa78 vm-virtio: Implement support for EVENT_IDX 793d4e7 vmm: Move codebase to GuestMemoryAtomic from vm-memory ddf6caf ci: Improve test_memory_mergeable_on stability af621be build(deps): bump micro_http from `57ac9df` to `6fd1545` 4970e2f vhost-user-fs: add dax tests for vhost_user_fs rust daemon 59958f0 vhost_user_fs: add the ability to set slave req fd 3f09eff vhost_user_fs: add fs cache request operations 956a84f vhost_user_fs: add necessary structs for map/unmap requests 269d660 vhost_user_fs: add SlaveFsCacheReq to handle map/unmap be78c6d vhost_rs: Fix unit test race condition f7378bc tests: Add self spawning vhost-user-block test 1f6cbad vmm: Add support for spawning vhost-user-block backend 4d60ef5 vm-virtio: vhost_user: block: On shutdown() drop the socket 7fabca3 ci: Don't run unit tests in a privileged container 2724716 build(deps): bump micro_http from `4827569` to `57ac9df` 08a68f2 build: Run unit tests on worker node f21cd31 scripts: dev_cli: Add more privileges for the integration tests a94887e build: Use dev container for integration tests 3edc2bd vmm: Prevent memory overcommitment through virtio-fs shared regions 968c90a build(deps): bump hermit-abi from 0.1.6 to 0.1.7 7485a0c Revert "build: Don't fail build on test_vfio failure" cbc0ac3 build(deps): bump micro_http from `7a23e54` to `4827569` 7fdb5ae build(deps): bump vm-memory from `eb2fc0b` to `f615b19` 0d748c5 build(deps): bump scopeguard from 1.0.0 to 1.1.0 6692fa6 build(deps): bump thiserror from 1.0.10 to 1.0.11 f03602a tests: Add self spawning vhost-user-net test bc75c1b vmm: Add support for spawning vhost-user-net backend d054ddd vm-virtio: Retry connections to vhost-user backends b04eb47 vmm: Follow the "exe" symlink from the PID directory in /proc 5038878 vm-virtio: vhost_user: net: On shutdown() drop the socket 7c9e8b1 vmm: device_manager: Shutdown all virtio devices 545ea9e vm-virtio: Add shutdown method to VirtioDevice trait ebd8369 main: Display git commit hash with the '--version' option bdb92f9 build(deps): bump micro_http from `7fb2e46` to `7a23e54` 2061f0d tests: Always create shared VFIO directory from scratch e8e4f43 tests: Use hugepages for test_vfio 296ada9 scripts: dev_cli: Fix post build permissions for the whole tree 287897d tests: Run test_vfio with PCI binary 1661444 build(deps): bump serde_json from 1.0.47 to 1.0.48 96479da build(deps): bump vm-memory from `f3d1c27` to `eb2fc0b` 88c1683 build(deps): bump memchr from 2.3.1 to 2.3.2 8d3e4f9 build(deps): bump micro_http from `c9e900c` to `7fb2e46` 53481aa docs: Update documentation related to multiqueue network 4dd16c2 vm-virtio: Detect if a tap interface supports multiqueue 8627656 net_util: Provide more accurate error messages 6e5338d build(deps): bump memchr from 2.3.0 to 2.3.1 014844d build: Don't fail build on test_vfio failure 779cbfe build(deps): bump backtrace from 0.3.43 to 0.3.44 700df9e vhost_user_net: Port to new exit event strategy c33c38b vhost_user_block: Port to new exit event strategy da7f31d bin: vhost_user_fs: Port to new exit event strategy 759a0be vhost_user_backend: Add support for handling exiting of worker thread b17bafb build(deps): bump micro_http from `1de6f32` to `c9e900c` 7ca691f vhost_user_block: Implement and use worker shutdown e619fe6 vhost_user_net: Remove "Clone" implementation 613f254 vhost_user_backend: Wait on the worker thread 97ab767 vhost_user_net: Shutdown worker thread on exit 7f032c8 bin: vhost_user_fs: Shutdown worker thread on exit 99cb8dc bin: vhost_user_fs use error! macro logging for consistency 710394b vhost_user_block: Forward the error from unexpected event 4f4c3d3 vhost_user_block: Make Error behave like net and fs versions f1e19d6 vhost_user_backend: Forward the error from main thread 80c9dc2 Revert "vhost-user-backend: Correct error handling in run" c706ca1 scripts: dev_cli: Simplify the build command exit path 0a1d6e1 scripts: dev_cli: Fix build directory permisions c8fa809 scripts: dev_cli: Run unprivileged containers as the host user 26d8cae build(deps): bump micro_http from `ae15e75` to `1de6f32` 572aaa7 build(deps): bump serde_json from 1.0.46 to 1.0.47 04cb35e scripts: Make dev_cli.sh exit on test error 9bf100c build: Run worker and master build in parallel bfbca59 scripts: Don't use interactive & terminal mode for docker 6e6eb5b build: Do cargo tests, unit tests and OpenAPI check on master a5b053f scripts: dev_cli: Use a tmpfs mount for /tmp Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Table of Contents
- What is it?
- Background
- Out of scope
- Design
- API
- Networking
- Storage
- Devices
- Developers
- Persistent storage plugin support
- Experimental features
What is it?
virtcontainers is a Go library that can be used to build hardware-virtualized container
runtimes.
Background
The few existing VM-based container runtimes (Clear Containers, runV, rkt's
KVM stage 1) all share the same hardware virtualization semantics but use different
code bases to implement them. virtcontainers's goal is to factorize this code into
a common Go library.
Ideally, VM-based container runtime implementations would become translation
layers from the runtime specification they implement (e.g. the OCI runtime-spec
or the Kubernetes CRI) to the virtcontainers API.
virtcontainers was used as a foundational package for the Clear Containers runtime implementation.
Out of scope
Implementing a container runtime is out of scope for this project. Any tools or executables in this repository are only provided for demonstration or testing purposes.
virtcontainers and Kubernetes CRI
virtcontainers's API is loosely inspired by the Kubernetes CRI because
we believe it provides the right level of abstractions for containerized sandboxes.
However, despite the API similarities between the two projects, the goal of
virtcontainers is not to build a CRI implementation, but instead to provide a
generic, runtime-specification agnostic, hardware-virtualized containers
library that other projects could leverage to implement CRI themselves.
Design
Sandboxes
The virtcontainers execution unit is a sandbox, i.e. virtcontainers users start sandboxes where
containers will be running.
virtcontainers creates a sandbox by starting a virtual machine and setting the sandbox
up within that environment. Starting a sandbox means launching all containers with
the VM sandbox runtime environment.
Hypervisors
The virtcontainers package relies on hypervisors to start and stop virtual machine where
sandboxes will be running. An hypervisor is defined by an Hypervisor interface implementation,
and the default implementation is the QEMU one.
Update cloud-hypervisor client code
See docs
Agents
During the lifecycle of a container, the runtime running on the host needs to interact with
the virtual machine guest OS in order to start new commands to be executed as part of a given
container workload, set new networking routes or interfaces, fetch a container standard or
error output, and so on.
There are many existing and potential solutions to resolve that problem and virtcontainers abstracts
this through the Agent interface.
Shim
In some cases the runtime will need a translation shim between the higher level container stack (e.g. Docker) and the virtual machine holding the container workload. This is needed for container stacks that make strong assumptions on the nature of the container they're monitoring. In cases where they assume containers are simply regular host processes, a shim layer is needed to translate host specific semantics into e.g. agent controlled virtual machine ones.
Proxy
When hardware virtualized containers have limited I/O multiplexing capabilities, runtimes may decide to rely on an external host proxy to support cases where several runtime instances are talking to the same container.
API
The high level virtcontainers API is the following one:
Sandbox API
-
CreateSandbox(sandboxConfig SandboxConfig)creates a Sandbox. The virtual machine is started and the Sandbox is prepared. -
DeleteSandbox(sandboxID string)deletes a Sandbox. The virtual machine is shut down and all information related to the Sandbox are removed. The function will fail if the Sandbox is running. In that caseStopSandbox()has to be called first. -
StartSandbox(sandboxID string)starts an already created Sandbox. The Sandbox and all its containers are started. -
RunSandbox(sandboxConfig SandboxConfig)creates and starts a Sandbox. This performsCreateSandbox()+StartSandbox(). -
StopSandbox(sandboxID string)stops an already running Sandbox. The Sandbox and all its containers are stopped. -
PauseSandbox(sandboxID string)pauses an existing Sandbox. -
ResumeSandbox(sandboxID string)resume a paused Sandbox. -
StatusSandbox(sandboxID string)returns a detailed Sandbox status. -
ListSandbox()lists all Sandboxes on the host. It returns a detailed status for every Sandbox.
Container API
-
CreateContainer(sandboxID string, containerConfig ContainerConfig)creates a Container on an existing Sandbox. -
DeleteContainer(sandboxID, containerID string)deletes a Container from a Sandbox. If the Container is running it has to be stopped first. -
StartContainer(sandboxID, containerID string)starts an already created Container. The Sandbox has to be running. -
StopContainer(sandboxID, containerID string)stops an already running Container. -
EnterContainer(sandboxID, containerID string, cmd Cmd)enters an already running Container and runs a given command. -
StatusContainer(sandboxID, containerID string)returns a detailed Container status. -
KillContainer(sandboxID, containerID string, signal syscall.Signal, all bool)sends a signal to all or one container inside a Sandbox.
An example tool using the virtcontainers API is provided in the hack/virtc package.
For further details, see the API documentation.
Networking
virtcontainers supports the 2 major container networking models: the Container Network Model (CNM) and the Container Network Interface (CNI).
Typically the former is the Docker default networking model while the later is used on Kubernetes deployments.
CNM
CNM lifecycle
-
RequestPool -
CreateNetwork -
RequestAddress -
CreateEndPoint -
CreateContainer -
Create
config.json -
Create PID and network namespace
-
ProcessExternalKey -
JoinEndPoint -
LaunchContainer -
Launch
-
Run container
Runtime network setup with CNM
-
Read
config.json -
Create the network namespace (code)
-
Call the prestart hook (from inside the netns) (code)
-
Scan network interfaces inside netns and get the name of the interface created by prestart hook (code)
-
Create bridge, TAP, and link all together with network interface previously created (code)
-
Start VM inside the netns and start the container (code)
Drawbacks of CNM
There are three drawbacks about using CNM instead of CNI:
- The way we call into it is not very explicit: Have to re-exec
dockerdbinary so that it can accept parameters and execute the prestart hook related to network setup. - Implicit way to designate the network namespace: Instead of explicitly giving the netns to
dockerd, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the VETH pair will be created with the wrong netns. - No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the
CreateSandboxpath, which is critical for starting the VM as quick as possible.
Storage
Container workloads are shared with the virtualized environment through 9pfs. The devicemapper storage driver is a special case. The driver uses dedicated block devices rather than formatted filesystems, and operates at the block level rather than the file level. This knowledge has been used to directly use the underlying block device instead of the overlay file system for the container root file system. The block device maps to the top read-write layer for the overlay. This approach gives much better I/O performance compared to using 9pfs to share the container file system.
The approach above does introduce a limitation in terms of dynamic file copy in/out of the container via docker cp operations.
The copy operation from host to container accesses the mounted file system on the host side. This is not expected to work and may lead to inconsistencies as the block device will be simultaneously written to, from two different mounts.
The copy operation from container to host will work, provided the user calls sync(1) from within the container prior to the copy to make sure any outstanding cached data is written to the block device.
docker cp [OPTIONS] CONTAINER:SRC_PATH HOST:DEST_PATH
docker cp [OPTIONS] HOST:SRC_PATH CONTAINER:DEST_PATH
Ability to hotplug block devices has been added, which makes it possible to use block devices for containers started after the VM has been launched.
How to check if container uses devicemapper block device as its rootfs
Start a container. Call mount(8) within the container. You should see / mounted on /dev/vda device.
Devices
Support has been added to pass VFIO assigned devices on the docker command line with --device. Support for passing other devices including block devices with --device has not been added added yet.
How to pass a device using VFIO-passthrough
- Requirements
IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups. VFIO uses this information to enforce safe ownership of devices for userspace.
You will need Intel VT-d capable hardware. Check if IOMMU is enabled in your host
kernel by verifying CONFIG_VFIO_NOIOMMU is not in the kernel configuration. If it is set,
you will need to rebuild your kernel.
The following kernel configuration options need to be enabled:
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m
In addition, you need to pass intel_iommu=on on the kernel command line.
- Identify BDF(Bus-Device-Function) of the PCI device to be assigned.
$ lspci -D | grep -e Ethernet -e Network
0000:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
$ BDF=0000:01:00.0
- Find vendor and device id.
$ lspci -n -s $BDF
01:00.0 0200: 8086:1528 (rev 01)
- Find IOMMU group.
$ readlink /sys/bus/pci/devices/$BDF/iommu_group
../../../../kernel/iommu_groups/16
- Unbind the device from host driver.
$ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind
- Bind the device to
vfio-pci.
$ sudo modprobe vfio-pci
$ echo 8086 1528 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
$ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind
- Check
/dev/vfio
$ ls /dev/vfio
16 vfio
- Start a Clear Containers container passing the VFIO group on the docker command line.
docker run -it --device=/dev/vfio/16 centos/tools bash
- Running
lspciwithin the container should show the device among the PCI devices. The driver for the device needs to be present within the Clear Containers kernel. If the driver is missing, you can add it to your custom container kernel using the osbuilder tooling.
Developers
For information on how to build, develop and test virtcontainers, see the
developer documentation.
Persistent storage plugin support
See the persistent storage plugin documentation.
Experimental features
See the experimental features documentation.

