Files
kata-containers/virtcontainers
Jose Carlos Venegas Munoz ebb8fd576b versions: Update clh to latest master
Use latest master to enable memory hotplug.

Changes:

c1e6d00 ci: Add memory resizing use case to vhost-user tests
890582b ci: Factorize kernel command line
4de2584 ci: Fix mmio tests with direct kernel boot
f268246 ci: Factorize integration tests booting from vhost-user-blk
5a5b3cf ci: Factorize vhost-user-blk integration tests
dd8debf ci: Run vhost-user-blk tests for mmio builds
0c9c72c ci: Unify vhost-user-blk integration tests
c95851f ci: Run vhost-user-net tests for mmio transport
68293fc ci: Factorize vhost-user-net one step further
d75e745 vm-virtio: vhost-user: Send memory update to the backend
7ff82af vm-virtio: vhost-user: Factorize SET_MEM_TABLE setup
e54f8ec vmm: Update memory through DeviceManager
bc874a9 vm-virtio: Add update_memory() to VirtioDevice trait
93becca build(deps): bump backtrace from 0.3.45 to 0.3.46
feb8d7a vmm: Separate seccomp filters between VMM and API threads
5120c27 main: Add seccomp support
f1a23d7 vmm: api: Add seccomp to the HTTP API thread
db62cb3 vmm: Add seccomp filter to the VMM thread
cb98d90 vmm: Create new seccomp_filter module
708f02d vmm: Pull seccomp crate from Firecracker
18fbd30 vhost-user-fs: return correct result of fs_slave_io()
bbc385c devices: ioapic: Remove unused MsiMessage structure
2fc86ff dev_cli: Always pull the latest container image
4b462a5 Dockerfile: Add cpio and bsdtar to the container image
8acc15a build: Bump vm-memory and linux-loader dependencies
38ed560 build(deps): bump thiserror from 1.0.12 to 1.0.13
9f67de4 build(deps): bump proc-macro-hack from 0.5.12 to 0.5.14
ebab809 build(deps): bump thiserror from 1.0.11 to 1.0.12
c67e407 build(deps): bump syn from 1.0.16 to 1.0.17
bdcfe1e tests: Add "discard_writes" pmem test
7098602 tests: Make the test_virtio_pmem test use a temporary file
f7197e8 vmm: Add a "discard_writes=" to --pmem
d11a67b vmm: Use more generic MmapRegion constructor
7257e89 vmm: Add "readonly" parameter MemoryManager::create_userspace_mapping
03cb26c release: v0.6.0
3e9a39c github: Upload the ch-remote asset
c503118 vmm: fix a corrupted stack caused by get_win_size
0788600 build: Remove "pvh_boot" feature flag
477bc17 bin: Share VFIO device syntax between cloud-hypervisor and ch-remote
96be2db build(deps): bump serde_derive from 1.0.104 to 1.0.105
5a335fc build(deps): bump serde from 1.0.104 to 1.0.105
a31ffef openapi: Add hotplug_size for memory hotplug
87990f9 vmm: Add virtio-pci device to B/D/F hash table
fb185fa vmm: Always return PCI B/D/F from add_virtio_pci_device
462082c build(deps): bump arc-swap from 0.4.4 to 0.4.5
c821e96 vhost_user_fs: Implement support for FUSE_LSEEK
5aa9abc docs: Add document for vhost-user-net test with OVS/DPDK
6329219 vm-virtio: queue: Use a SeqCst fence on get_used_event
63eeed2 vm: Comment on the VM config update from memory hotplug
0895bcb build(deps): bump proc-macro-hack from 0.5.11 to 0.5.12
0541f5a build(deps): bump proc-macro-nested from 0.1.3 to 0.1.4
51f51ea build(deps): bump libc from 0.2.67 to 0.2.68
9cf67d1 arch: x86: Always set the bootloader type
ad35470 arch: x86: Extract common bootparams settings
28a5f9d vmm: acpi: Remove unused IORT related structures
5c1207c vhost-user-fs: handle FS_IO request
f61f78e build(deps): bump anyhow from 1.0.26 to 1.0.27
efb2447 pvh: Add integration test to validate PVH boot
da084fa pvh: Add unit tests for initial sregs and control registers
64941bf pvh: Add unit tests for start_info and memory map structures
9e247c4 pvh: Introduce "pvh_boot" feature
a22bc35 pvh: Write start_info structure to guest memory
840a9a9 pvh: Initialize vCPU regs/sregs for PVH boot
24f0e42 pvh: Introduce EntryPoint struct
98b9568 pvh: Add definitions for PVH boot protocol support
6e6ef83 build: Fix log dependency
291f1ce build(deps): bump linux-loader from `0c754f3` to `0ce5bfa`
07cc73b vhost_user_fs: add a flag to disable extended attributes
710520e vhost_user_fs: Process requests in parallel with a thread pool
90309b5 vm-virtio: queue: Add methods to switch a descriptor context
2294c2d Add .rustfmt.toml to the project
48c4885 vhost_user_fs: replace HandleData's File Mutex with RwLock
134e64c arch, qcow: Fix 1.42.0 clippy warnings
6ea85ca resources: Dockerfile: Update Rust toolchain
4579afa vmm: For --disk error if socket and path is specified
7e599b4 vmm: Make disk path optional
477d924 github: Build from a rust toolchain matrix
4f2469e main: Remove "--vhost-user-net"
8d785bb pci: Fix the PciBus using HashMap instead of Vec
04f2ccd build(deps): bump ryu from 1.0.2 to 1.0.3
02265bb build(deps): bump regex-syntax from 0.6.16 to 0.6.17
40b38a4 openapi: Make desired_ram int64 format
ca3b39c bin: Fix wrapping in help strings
ee1ba56 build: Use "wrap_help" feature for clap
3957d1e vhost_user_backend: call get_used_event from needs_notification
536323d vm-virtio: queue: hint that get_used_event should be inlined
401e1d2 vm-virtio: queue: fix a barrier comment at update_avail_event
e0bdfe8 vm-virtio: queue: add a missing memory barrier in get_used_event
df2570a resources: Simplify kernel config filename
9ab648b resources: Enable VIRTIO_MEM support
0339853 ci: Bump to kernel 5.6-rc4
abccf76 tests: Use ch-remote to add/remove devices in test_vfio
5c3ce9d tests: Extend ch-remote helper to support optional single argument
9a7d9c9 ch-remote: Support removing VFIO devices
0d53ba4 ch-remote: Support adding VFIO devices
babefbd main: Remove spurious second help line for "--device"
63c5d09 github: Trigger the build job on PRs
8cbb6d0 github: Replace Travis CI with github actions
efba48d vmm: Don't put a VFIO device behind the vIOMMU by default
34412c9 vmm: Add id option to VFIO hotplug
18dc916 vmm: Switch to the micro-http package
9023444 vmm: Add id field to --device through CLI
f4a956a vmm: Remove 32 bits MMIO range from correct address space
432eb5b vmm: Free PCI BARs when unplugging PCI device
f0dff8b vfio: pci: Remove KVM user memory region when cleaning up
34d1f43 vfio: pci: Implement free_bars() from the PciDevice trait
b8e1cf2 vm-allocator: Add new function to free 32 bits MMIO address space
f3dc245 pci: Extend PciDevice trait with new free_bars() method
911a2d6 tests: Use ch-remote to resize the VM
21160f7 ch-remote: Add "resize" command
bb2d04b ch-remote: Add support for sending a request body
bde4f73 ch-remote: Refactor HTTP response handling
6ed23bb build(deps): bump micro_http from `9bbde4f` to `6b3e5f0`
5edd812 build(deps): bump backtrace-sys from 0.1.33 to 0.1.34
f727714 ci: Add integration test for VFIO hot-unplug
b50cbe5 pci: Give PCI device ID back when removing a device
df71aae pci: Make the device ID allocation smarter
e514b12 vmm: Update VmConfig when removing VFIO device
81173bf vmm: Add id field to DeviceConfig structure
6cbdb9a vmm: api: Introduce new "remove-device" HTTP endpoint
991f3bb vmm: Remove VFIO device from everywhere it is referenced
6adebbc vmm: Detect when guest notifies about ejecting PCI device
0e21c32 devices: Add new method to remove all occurrences of a BusDevice
f8e2008 pci: Add a function to remove a PciDevice from the bus
08604ac vmm: Store PCI devices as Any devices from DeviceManager
0f99d3f vmm: Store VFIO device's name and its PCI b/d/f
13a61c4 build(deps): bump rand_chacha from 0.2.1 to 0.2.2
fcd605a build(deps): bump micro_http from `6d416af` to `9bbde4f`
30b6954 vm-virtio: Consume pause events to prevent infinite epoll_wait calls
16fd506 tests: Use new ch-remote for pause/resume integration test
ba8cd4d bin: Introduce "ch-remote" for controlling VMM
06cd31c build(deps): bump micro_http from `02def92` to `6d416af`
7e941c9 build(deps): bump linux-loader from `8cb7c66` to `0c754f3`

Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
2020-03-25 04:30:58 +00:00
..
2020-03-25 04:30:58 +00:00
2018-03-13 00:49:46 +01:00
2020-02-12 19:09:32 +00:00
2020-03-19 19:13:51 +00:00
2020-03-20 07:03:55 +00:00
2019-07-23 17:10:00 +08:00
2019-09-26 16:17:16 +02:00
2018-05-04 15:38:32 +08:00
2018-03-13 00:49:46 +01:00
2019-08-21 11:35:48 +08:00
2020-01-28 14:42:16 -08:00
2018-09-14 08:54:55 +08:00
2019-03-06 17:02:15 +01:00
2020-01-09 16:29:20 +01:00
2019-07-23 17:10:00 +08:00
2019-09-26 16:17:16 +02:00
2019-07-23 17:10:00 +08:00
2020-03-19 19:15:31 +00:00

Table of Contents

What is it?

virtcontainers is a Go library that can be used to build hardware-virtualized container runtimes.

Background

The few existing VM-based container runtimes (Clear Containers, runV, rkt's KVM stage 1) all share the same hardware virtualization semantics but use different code bases to implement them. virtcontainers's goal is to factorize this code into a common Go library.

Ideally, VM-based container runtime implementations would become translation layers from the runtime specification they implement (e.g. the OCI runtime-spec or the Kubernetes CRI) to the virtcontainers API.

virtcontainers was used as a foundational package for the Clear Containers runtime implementation.

Out of scope

Implementing a container runtime is out of scope for this project. Any tools or executables in this repository are only provided for demonstration or testing purposes.

virtcontainers and Kubernetes CRI

virtcontainers's API is loosely inspired by the Kubernetes CRI because we believe it provides the right level of abstractions for containerized sandboxes. However, despite the API similarities between the two projects, the goal of virtcontainers is not to build a CRI implementation, but instead to provide a generic, runtime-specification agnostic, hardware-virtualized containers library that other projects could leverage to implement CRI themselves.

Design

Sandboxes

The virtcontainers execution unit is a sandbox, i.e. virtcontainers users start sandboxes where containers will be running.

virtcontainers creates a sandbox by starting a virtual machine and setting the sandbox up within that environment. Starting a sandbox means launching all containers with the VM sandbox runtime environment.

Hypervisors

The virtcontainers package relies on hypervisors to start and stop virtual machine where sandboxes will be running. An hypervisor is defined by an Hypervisor interface implementation, and the default implementation is the QEMU one.

Update cloud-hypervisor client code

See docs

Agents

During the lifecycle of a container, the runtime running on the host needs to interact with the virtual machine guest OS in order to start new commands to be executed as part of a given container workload, set new networking routes or interfaces, fetch a container standard or error output, and so on. There are many existing and potential solutions to resolve that problem and virtcontainers abstracts this through the Agent interface.

Shim

In some cases the runtime will need a translation shim between the higher level container stack (e.g. Docker) and the virtual machine holding the container workload. This is needed for container stacks that make strong assumptions on the nature of the container they're monitoring. In cases where they assume containers are simply regular host processes, a shim layer is needed to translate host specific semantics into e.g. agent controlled virtual machine ones.

Proxy

When hardware virtualized containers have limited I/O multiplexing capabilities, runtimes may decide to rely on an external host proxy to support cases where several runtime instances are talking to the same container.

API

The high level virtcontainers API is the following one:

Sandbox API

  • CreateSandbox(sandboxConfig SandboxConfig) creates a Sandbox. The virtual machine is started and the Sandbox is prepared.

  • DeleteSandbox(sandboxID string) deletes a Sandbox. The virtual machine is shut down and all information related to the Sandbox are removed. The function will fail if the Sandbox is running. In that case StopSandbox() has to be called first.

  • StartSandbox(sandboxID string) starts an already created Sandbox. The Sandbox and all its containers are started.

  • RunSandbox(sandboxConfig SandboxConfig) creates and starts a Sandbox. This performs CreateSandbox() + StartSandbox().

  • StopSandbox(sandboxID string) stops an already running Sandbox. The Sandbox and all its containers are stopped.

  • PauseSandbox(sandboxID string) pauses an existing Sandbox.

  • ResumeSandbox(sandboxID string) resume a paused Sandbox.

  • StatusSandbox(sandboxID string) returns a detailed Sandbox status.

  • ListSandbox() lists all Sandboxes on the host. It returns a detailed status for every Sandbox.

Container API

  • CreateContainer(sandboxID string, containerConfig ContainerConfig) creates a Container on an existing Sandbox.

  • DeleteContainer(sandboxID, containerID string) deletes a Container from a Sandbox. If the Container is running it has to be stopped first.

  • StartContainer(sandboxID, containerID string) starts an already created Container. The Sandbox has to be running.

  • StopContainer(sandboxID, containerID string) stops an already running Container.

  • EnterContainer(sandboxID, containerID string, cmd Cmd) enters an already running Container and runs a given command.

  • StatusContainer(sandboxID, containerID string) returns a detailed Container status.

  • KillContainer(sandboxID, containerID string, signal syscall.Signal, all bool) sends a signal to all or one container inside a Sandbox.

An example tool using the virtcontainers API is provided in the hack/virtc package.

For further details, see the API documentation.

Networking

virtcontainers supports the 2 major container networking models: the Container Network Model (CNM) and the Container Network Interface (CNI).

Typically the former is the Docker default networking model while the later is used on Kubernetes deployments.

CNM

High-level CNM Diagram

CNM lifecycle

  1. RequestPool

  2. CreateNetwork

  3. RequestAddress

  4. CreateEndPoint

  5. CreateContainer

  6. Create config.json

  7. Create PID and network namespace

  8. ProcessExternalKey

  9. JoinEndPoint

  10. LaunchContainer

  11. Launch

  12. Run container

Detailed CNM Diagram

Runtime network setup with CNM

  1. Read config.json

  2. Create the network namespace (code)

  3. Call the prestart hook (from inside the netns) (code)

  4. Scan network interfaces inside netns and get the name of the interface created by prestart hook (code)

  5. Create bridge, TAP, and link all together with network interface previously created (code)

  6. Start VM inside the netns and start the container (code)

Drawbacks of CNM

There are three drawbacks about using CNM instead of CNI:

  • The way we call into it is not very explicit: Have to re-exec dockerd binary so that it can accept parameters and execute the prestart hook related to network setup.
  • Implicit way to designate the network namespace: Instead of explicitly giving the netns to dockerd, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the VETH pair will be created with the wrong netns.
  • No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the CreateSandbox path, which is critical for starting the VM as quick as possible.

Storage

Container workloads are shared with the virtualized environment through 9pfs. The devicemapper storage driver is a special case. The driver uses dedicated block devices rather than formatted filesystems, and operates at the block level rather than the file level. This knowledge has been used to directly use the underlying block device instead of the overlay file system for the container root file system. The block device maps to the top read-write layer for the overlay. This approach gives much better I/O performance compared to using 9pfs to share the container file system.

The approach above does introduce a limitation in terms of dynamic file copy in/out of the container via docker cp operations. The copy operation from host to container accesses the mounted file system on the host side. This is not expected to work and may lead to inconsistencies as the block device will be simultaneously written to, from two different mounts. The copy operation from container to host will work, provided the user calls sync(1) from within the container prior to the copy to make sure any outstanding cached data is written to the block device.

docker cp [OPTIONS] CONTAINER:SRC_PATH HOST:DEST_PATH
docker cp [OPTIONS] HOST:SRC_PATH CONTAINER:DEST_PATH

Ability to hotplug block devices has been added, which makes it possible to use block devices for containers started after the VM has been launched.

How to check if container uses devicemapper block device as its rootfs

Start a container. Call mount(8) within the container. You should see / mounted on /dev/vda device.

Devices

Support has been added to pass VFIO assigned devices on the docker command line with --device. Support for passing other devices including block devices with --device has not been added added yet.

How to pass a device using VFIO-passthrough

  1. Requirements

IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups. VFIO uses this information to enforce safe ownership of devices for userspace.

You will need Intel VT-d capable hardware. Check if IOMMU is enabled in your host kernel by verifying CONFIG_VFIO_NOIOMMU is not in the kernel configuration. If it is set, you will need to rebuild your kernel.

The following kernel configuration options need to be enabled:

CONFIG_VFIO_IOMMU_TYPE1=m 
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m

In addition, you need to pass intel_iommu=on on the kernel command line.

  1. Identify BDF(Bus-Device-Function) of the PCI device to be assigned.
$ lspci -D | grep -e Ethernet -e Network
0000:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

$ BDF=0000:01:00.0
  1. Find vendor and device id.
$ lspci -n -s $BDF
01:00.0 0200: 8086:1528 (rev 01)
  1. Find IOMMU group.
$ readlink /sys/bus/pci/devices/$BDF/iommu_group
../../../../kernel/iommu_groups/16
  1. Unbind the device from host driver.
$ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind
  1. Bind the device to vfio-pci.
$ sudo modprobe vfio-pci
$ echo 8086 1528 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
$ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind
  1. Check /dev/vfio
$ ls /dev/vfio
16 vfio
  1. Start a Clear Containers container passing the VFIO group on the docker command line.
docker run -it --device=/dev/vfio/16 centos/tools bash
  1. Running lspci within the container should show the device among the PCI devices. The driver for the device needs to be present within the Clear Containers kernel. If the driver is missing, you can add it to your custom container kernel using the osbuilder tooling.

Developers

For information on how to build, develop and test virtcontainers, see the developer documentation.

Persistent storage plugin support

See the persistent storage plugin documentation.

Experimental features

See the experimental features documentation.