Commit Graph

3686 Commits

Author SHA1 Message Date
Pradipta Banerjee
39e8c84269 runtime: Add support for key annotations to remote hyp
In order to support different pod VM instance type via
remote hypervisor implementation (cloud-api-adaptor),
we need to pass machine_type, default_vcpus
and default_memory annotations to cloud-api-adaptor.

The cloud-api-adaptor then uses these annotations to spin
up the appropriate cloud instance.

Reference PR for cloud-api-adaptor
https://github.com/confidential-containers/cloud-api-adaptor/pull/1088

Fixes: #7140
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
(based on commit 004f07f076)
2023-11-17 13:33:27 +00:00
Yohei Ueda
2910e333a8 runtime: Use static resource in remote hypervisor
This patch updates the template configuration file for
the remote hypervisor to set static_sandbox_resource_mgmt
to be true.  The remote hypervisor uses the peer pod config
to determine the sandbox size, so requires this to be set to
true by default.

Fixes: #6616
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit 938447803b)
2023-11-17 13:33:27 +00:00
stevenhorsman
26d56678a9 config: Add initial remote hypervisor config
- Remote hypervisor template config
- Add annotation enablement for machine_type, default_memory and
default_vcpus for flexible instance types

Fixes: #6349
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
(based on commits 7c9a791d67
and 335a456425)
2023-11-17 13:33:24 +00:00
stevenhorsman
ad63439a3e runtime: Update the remote hypervisor config
Add the SELinux setting to ensure it is passed through to the remote
hypervisor

Fixes: #5936

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 3ef2fd1784)
2023-11-17 13:32:52 +00:00
Lei Li
50e0d43dad runtime: Support privileged containers in peer pod VM
This patch fixes the issue of running containers
with privileged as true.

See the discussion at this URL for the details.
https://github.com/confidential-containers/cloud-api-adaptor/issues/111

Signed-off-by: Lei Li <cdlleili@cn.ibm.com>
(based on commit c3e6b66051)
2023-11-17 13:32:52 +00:00
Yohei Ueda
57d4dd8e57 runtime: Support the remote hypervisor type
This patch adds the support of the remote hypervisor type.
Shim opens a Unix domain socket specified in the config file,
and sends TTPRC requests to a external process to control
sandbox VMs.

Fixes #4482

Co-authored-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit f9278f22c3)
2023-11-17 13:32:49 +00:00
Yohei Ueda
8ac9a22097 runtime: Add hypervisor proto to support peer pod VMs
This patch adds a protobuf definiton of the remote hypervisor type.

Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 150e8aba6d)
2023-11-17 13:31:09 +00:00
Liu Wenyuan
c77e990c3e tests: Enable tests for StratoVirt hypervisor
This commit enables StratoVirt hypervisor to be tested in kata GHA,
incluing k8s, metrics, cri-containerd, nydus and so on.

Meanwhile, adding some unit tests for StratoVirt to make sure it works.

Fixes: #7794

Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
2023-11-16 20:47:26 +08:00
Liu Wenyuan
9542211e71 configuration: add configuration for StratoVirt hypervisor.
Add configuration-stratovirt.toml.in to generate the StratoVirt configuration,
and parser to deliver config to StratoVirt.

Fixes: #7794

Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
2023-11-16 20:47:26 +08:00
Liu Wenyuan
561c85be54 build: Makefile for StratoVirt hypervisor
Add support for building StratoVirt hypervisor, including x86_64 and
arm64.

Fixes: #7794

Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
2023-11-16 20:47:26 +08:00
Liu Wenyuan
26966c8469 virtcontainers: Add StratoVirt as a supported hypervisor
Initial support of the MicroVM machine type of StratoVirt
hypervisor for the kata go runtime.

Fixes: #7794

Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
2023-11-16 20:47:24 +08:00
Xuewei Niu
f18794d880 Merge pull request #8426 from justxuewei/vhost-rm-virtio-net
dragonball: Remove vhost-net dependency on virtio-net
2023-11-15 10:39:27 +08:00
Fabiano Fidêncio
fd9b6d6837 Merge pull request #7623 from fidencio/topic/runtime-improve-vcpu-allocation-on-host-side
runtime: Improve vCPU allocation for the VMMs
2023-11-14 14:10:54 +01:00
Xuewei Niu
49c2e6e23c dragonball: Remove vhost-net dependency on virtio-net
This patch is to remove vhost-net dependency on virtio-net for
dbs-virtio-devices crate. Then, the feature of vhost-net is able to enable
without enabling virtio-net device, error, etc.

Fixes: #8423

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-14 15:35:10 +08:00
James O. D. Hunt
7f666f783d runtime-rs: ch: Fix TDX
PR #8311 inadvertently broke the runtime-rs / Cloud Hypervisor TDX
handling. It also introduced unrecoverable failure scenarios. Hence,
replace slow, fallible regex matching in logging fast path with single pass
non-failing multi-string log level matching.

Also, added a unit test for `parse_ch_log_level()`.

Fixes: #8418.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2023-11-13 08:49:47 +00:00
Xuewei Niu
0a9125e629 Merge pull request #7675 from justxuewei/vhost-net 2023-11-12 20:38:18 +08:00
Xuewei Niu
d1deaf0538 dragonball: Minor changes for a comment from Bian
- Add feature control for InsertNetworkDevice.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-12 14:14:10 +08:00
Xuewei Niu
e4f83e27c4 dragonball: vhost-net set_offload with acked features
set_offload() for tap devices depends on acked features.

Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-12 14:10:39 +08:00
Xuewei Niu
6cd572dbbb dragonball: Minor changes for Chao's comments
- Remove two panic statements from InsertNetworkDevice test.
- Rename `NUM_QUEUES` to `DEFAULT_NUM_QUEUES`, `QUEUE_SIZE` to
  `DEFAULT_QUEUE_SIZE` for vhost-net and virtio-net.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-12 14:10:39 +08:00
Xuewei Niu
dcdf3c6556 runtime-rs: Supply missing fields of NetworkConfig
`test_networkconfig_to_netconfig` from clh depends on `NetworkConfig` which
has some new fields in this PR. Therefore, this commit gives the test
missing fields.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-12 14:10:39 +08:00
Xuewei Niu
58e9709c1f dragonball: Changes for ZizhengBian's comments
- Dragonball's vhost-net feature not depends on virtio-net feature.
- Remove `TapError` from dbs-virtio-devices's Error, and add `VirtioNet`
  and `VhostNet` two fields.
- Downgrade visiblity of two fields of `VhostNetDeviceMgr` from
  `pub(crate)`.
- File an issue to record a todo for network rate limiter.
- Print internal errors with `{0:?}.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-12 14:10:33 +08:00
Fabiano Fidêncio
5e9cf75937 vc: utils: Rename CalculateMilliCPUs() to CalculateCPUsF()
With the change done in the last commit, instead of calculating milli
cpus, we're actually converting the CPUs to a fraction number, a float.

Let's update the function name (and associated vars) to represent that
change.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-11-10 18:26:01 +01:00
Fabiano Fidêncio
e477ed0e86 runtime: Improve vCPU allocation for the VMMs
First of all, this is a controversial piece, and I know that.

In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.

The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs

The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs.  However, it leads us to, in several cases, be
wasting one vCPU.

Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.

In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs

With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs

In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).

There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else.  Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.

One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result.  It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.

Fixes: #6909

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2023-11-10 18:25:57 +01:00
Fabiano Fidêncio
b0157ad73a runtime: confidential: Do not set the max_vcpu to cpu
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-11-10 12:58:20 +01:00
Chao Wu
a62fb83c91 Merge pull request #8169 from openanolis/chao/fix_typo_shm
runtime-rs: fix a typo in shm
2023-11-10 14:00:11 +08:00
gaohuatao
78df1bb851 agent: update AGENT_THREADS metrics value
Fixes: #8369

Signed-off-by: gaohuatao <gaohuatao@bytedance.com>
2023-11-10 10:39:57 +08:00
Chao Wu
afb002c25c runtime-rs: fix a typo in shm
is_shim_volume should be is_shm_volume in shm_volume mod.

fixes: #8168
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
2023-11-10 10:36:58 +08:00
Archana Shinde
1611723465 Merge pull request #8379 from likebreath/1103/clh_v36.0
Upgrade to Cloud Hypervisor v36.0
2023-11-08 21:10:41 -08:00
Archana Shinde
268d4d622f Merge pull request #8389 from justxuewei/vm-capable-test
runtime: Fix TestCheckHostIsVMContainerCapable unstablity issue
2023-11-08 12:14:04 -08:00
Archana Shinde
92a517156c Merge pull request #8367 from amshinde/add-nerdctl-ipvlan-test
network: Fix network hotplug for ipvlan and macvlan endpoints for qemu and add tests
2023-11-08 11:45:13 -08:00
Chelsea Mafrica
83e731328f Merge pull request #8023 from cmaf/runtime-rs-ch-pause-resume
runtime-rs: Update status for pause and resume
2023-11-08 11:34:47 -08:00
Xuewei Niu
acd9057c7b runtime: Fix TestCheckHostIsVMContainerCapable unstablity issue
TestCheckHostIsVMContainerCapable removes sysModuleDir to simulate a
case that the kernel modules are not loaded. However,
checkKernelModules() executes modprobe <module> if a module not
found in that directory. Loading those modules is required to be denied
temporarily.

Fixes: #8390

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-08 22:40:08 +08:00
Fupan Li
100a73d2fd Merge pull request #7531 from justxuewei/device-cgroup
agent: Restrict device access at upper node of container's cgroup
2023-11-08 22:01:48 +08:00
Xuewei Niu
023d8dc01e agent: Changes according to Pan's comments
- Disable device cgroup restriction while pod cgroup is not available.
- Remove balcklist-related names and change whitelist-related names to
  allowed_all.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-08 09:39:08 +08:00
Xuewei Niu
b5f3a8cb39 agent: Fix container launching failure with systemd cgroup
FSManager of systemd cgroup manager is responsible for setting up cgroup
path. The container launching will be failed if the FSManager is in
read-only mode.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-08 09:39:07 +08:00
Xuewei Niu
6477825195 agent: Minor changes according to Zhou's comments
The changes include:

- Change to debug logging level for resources after processed.
- Remove a todo for pod cgroup cleanup.
- Add an anyhow context to `get_paths_and_mounts()`.
- Remove code which denys access to VMROOTFS since it won't take effect. If
  blackmode is in use, the VMROOTFS will be denyed as default. Otherwise,
  device cgroups won't be updated in whitelist mode.
- Add a unit test for `default_allowed_devices()`.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-08 09:39:07 +08:00
Xuewei Niu
cec8044744 agent: Make devcg_info optional for LinuxContainer::new()
The runk is a standard OCI runtime that isnt' aware of concept of sandbox.
Therefore, the `devcg_info` argument of `LinuxContainer::new()` is
unneccessary to be provided.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-08 09:39:07 +08:00
Xuewei Niu
ef4c3844a3 agent: Restrict device access at upper node of container's cgroup
The target is to guarantee that containers couldn't escape to access extra
devices, like vm rootfs, etc.

Assume that there is a cgroup, such as `/A/B`. The `B` is container cgroup,
and the `A` is what we called pod cgroup. No matter what permissions are
set for the container (`B`), the `A`'s permission is always `a *:* rwm`. It
leads that containers could acquire permission to access to other devices
in VM that not belongs to themselves.

In order to set devices cgroup properly, the order of setting cgroups is
that the pod cgroup comes first and the container cgroup comes after.

The `Sandbox` has a new field, `devcg_info`, to save cgroup states. To
avoid setting container cgroup too early, an initialization should be done
carefully. `inited`, one of the states, is a boolean to indicate if the pod
cgroup is initialized. If no, the pod cgroup should be created firstly, and
set default permissions. After that, the pause container cgroup is created
and inherits the permissions from the pod cgroup.

If whitelist mode which allows containers to access all devices in VM is
enabled,  then device resources from OCI spec are ignored.

This feature not supports systemd cgroup and cgroup v2, since:

- Systemd cgroup implemented on Agent hasn't supported devices subsystem so
  far, see: https://github.com/kata-containers/kata-containers/issues/7506.
- Cgroup v2's device controller depends on eBPF programs, which is out of
  scope of cgroup.

Fixes: #7507

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-08 09:39:07 +08:00
Archana Shinde
a6272733e7 network: Fix network hotplug for ipvlan and macvlan endpoints.
Since moving from network coldplug to hotplug, the only case verified
was veth endpoints. Support for network hotplug for ipvlan and macvlan was
broken/not added. Fix it.

Fixes: #8391

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
2023-11-07 10:13:51 -08:00
James O. D. Hunt
59d0d4caff runtime-rs: ch: Simplify VSOCK error handling
Remove the redundant `VmConfigError::EmptyVsockSocketPath` error from
the Cloud Hypervisor config crate since this scenario is already handled
by the `VsockConfigError::NoVsockSocketPath` error.

Fixes: #8385.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2023-11-07 17:45:38 +00:00
James O. D. Hunt
bdb83f8282 runtime-rs: ch: Remove unused function
Remove the redundant `parse_mac()` function: this was never used and we
already have an implementation in `crates/resource/src/network/utils/mod.rs`.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2023-11-07 17:45:38 +00:00
Xuewei Niu
8ea87405ed runtime-rs: Remove virtio config from Backend
Virtio-net and vhost-net share a common virtio config, and vhost-user-net
uses another config, named `VhostUserConfig`. Thus, the virtio config could
be added into `NetworkConfig` instead of `Backend`.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-07 19:35:02 +08:00
Xuewei Niu
ad66378bf5 runtime-rs: Move Dragonball stuff out of device drivers
Moving Dragonball structs convertions out of device drivers to keep driver
neutral. The convertions include `NetworkBackend` to
`DragonballNetworkBackend` and `NetworkConfig` to
`DragonballNetworkConfig`.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-07 19:35:02 +08:00
Xuewei Niu
3e0614cdf0 dragonball: Minor changes to comments
Changes include:

- Merge `VhostNetDeviceError` import item.
- Replace if with match in `add_vhost_net_device()`

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-07 19:35:02 +08:00
Xuewei Niu
a047331a34 runtime-rs: Network config distinguishes backends
Network backends determine the virtio dataplane implementations. Common
protocols include virtio-net, vhost-net and vhost-user-net, etc. Network
config has a new field named `backend` to specify which protocol to use.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-07 19:35:02 +08:00
Xuewei Niu
9203371833 dragonball: Introduce vhost-net device
PLEASE NOTE THAT this pull request just implements vhost-net support for
Dragonball, and adaptation for the Runtime-rs. And this pull request
DOESN'T provide an item to config which backend to use. To sum up,
virtio-net as a default backend is only choice for the user so far.

This pull request introduces vhost-net device for the Dragonball. In
addition, this pull request includes changes of Runtime-rs to improve
network configuration abilities.

The Dragonball part implements a vhost-net device and a vhost-net device
manager, named `VhostNetDeviceMgr`, to manage vhost-net device.
`NetworkInterfaceConfig` is introduced as a high-level abstract for network
config. Then, the Dragonball is able to distinguish network backends, e.g.
virtio-net, vhost-net, vhost-user-net(WIP), etc.

The Runtime-rs part adds support of multiple network backends as well.
`NetworkConfig` has a couple of new fields, like `backend`,
`use_shared_irq`, etc. And Dragonball's network config structs are
implmented `From` trait which allow to be converted from the Runtime-rs's
network config conveniently.

Fixes: #7674

Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: wllenyj <wllenyj@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2023-11-07 19:35:02 +08:00
Beraldo Leal
dd530ba8ee tests: fixes AMD errors
TestCheckHostIsVMContainerCapable is failing on AMD machines.
kata-check_amd64_test.go:96 has no AMD modules, also getCPUType is
missing.

Fixes #8384.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:59 +00:00
Beraldo Leal
7641c19f74 runtime: bump containerd for gogo deprecation
This update includes necessary changes due to the version bump of
containerd and its dependencies. It's part of a broader initiative to
phase out gogo protobuf, which has been deprecated, and to align with
the current supported libraries.

Fixes #7420.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:59 +00:00
Beraldo Leal
16fa2c39e6 protocols: replace gogo/types.Empty and Any
by Google versions.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:58 +00:00
Beraldo Leal
c61f4a8592 protocols: remove unused fieldpath option
The +fieldpath option, specific to gogoprotobuf, enabled dynamic field
access in protobuf messages, allowing nested fields to be accessed via
string paths.

This change is part of a larger effort to transition to the official Go
protobuf library for better maintainability and community support.
Upon review, no instances of dynamic field access were found in the
codebase, confirming that the feature is not in use.

By removing this unused feature, we simplify the build process and make
it easier to complete the transition away from gogoprotobuf.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:58 +00:00