Users have noticed that this is needed, as CLH does not yet implement a
way to hotplug resources on aarh64.
With this patch, when building for x86_64, I can see the this is the
resulting config:
```
$ ARCH=amd64 make
...
$ cat config/configuration-clh.toml | grep static_sandbox_resource_mgmt
static_sandbox_resource_mgmt=false
```
And when building for aarch64:
```
$ ARCH=arm64 make
...
$ cat config/configuration-clh.toml | grep static_sandbox_resource_mgmt
static_sandbox_resource_mgmt=true
```
Fixes: #7941
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 72599f1911)
Signed-off-by: Greg Kurz <groug@kaod.org>
PR #6146 added the possibility to control QEMU with an extra HMP socket
as an aid for debugging. This is great for development or bug chasing
but this raises some concerns in production.
The HMP monitor allows to temper with the VM state in a variety of ways.
This could be intentionally or mistakenly used to inject subtle bugs in
the VM that would be extremely hard if not even impossible to debug. We
definitely don't want that to be enabled by default.
The feature is currently wired to the `enable_debug` setting in the
`[hypervisor.qemu]` section of the configuration file. This setting has
historically been used to control "debug output" and it is used as such
by some downstream users (e.g. Openshift). Forcing people to have the
extra HMP backdoor at the same time is abusive and dangerous.
A new `extra_monitor_socket` is added to `[hypervisor.qemu]` to give
fine control on whether the HMP socket is wanted or not. This setting
is still gated by `enable_debug = true` to make it clear it is for
debug only. The default is to not have the HMP socket though. This
isn't backward compatible with #6416 but it is for the sake of "better
safe than sorry".
An extra monitor socket makes the QEMU instance untrusted. A warning is
thus logged to the journal when one is requested.
While here, also allow the user to choose between HMP and QMP for the
extra monitor socket. Motivation is that QMP offers way more options to
control or introspect the VM than HMP does. Users can also ask for
pretty json formatting well suited for human reading. This will improve
the debugging experience.
This feature is only made visible in the base and GPU configurations
of QEMU for now.
Fixes#7952
Signed-off-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit 1f16b6627b)
Signed-off-by: Greg Kurz <groug@kaod.org>
by enabling IOMMU on the default PCI segment. For hotplug to work we need a
virtualized iommu and clh exposes one if there is some device or PCI segment
that requests it. I would have preferred to add a separate PCI segment for
hotplugging vfio devices but unfortunately kata assumes there is only one
segment all over the place. See create_pci_root_bus_path(),
split_vfio_pci_option() and grep for '0000'.
Enabling the IOMMU on the default PCI segment requires passing enabling IOMMU on
every device that is attached to it, which is why it is sprinkled all over the
place.
CLH does not support IOMMU for VirtioFs, so I've added a non IOMMU segment for
that device.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
(cherry picked from commit 3a1db7a86b)
hot_plug_vfio needs to be set to root-port, otherwise attaching vfio devices to
CLH VMs fails. Either cold_plug_vfio or hot_plug_vfio is required, and we have
not implemented support for cold_plug_vfio in CLH yet.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
(cherry picked from commit 509771e6f5)
Currently, even when using devmapper, if the VMM supports virtio-fs /
virtio-9p, that's used to share a few files between the host and the
guest.
This *needed*, as we need to share with the guest contents like secrets,
certificates, and configurations, via Kubernetes objects like configMaps
or secrets, and those are rotated and must be updated into the guest
whenever the rotation happens.
However, there are still use-cases users can live with just copying
those files into the guest at the pod creation time, and for those
there's absolutely no need to have a shared filesystem process running
with no extra obvious benefit, consuming memory and even increasing the
attack surface used by Kata Containers.
For the case mentioned above, we should allow users, making it very
clear which limitations it'll bring, to run Kata Containers with
devmapper without actually having to use a shared file system, which is
already the approach taken when using Firecracker as the VMM.
Fixes: #7207
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The `-o` option is the legacy way to configure virtiofsd, inherited
from the C implementation. The rust implementation honours it for
compatibility but it logs deprecation warnings.
Let's use the replacement options in the go shim code. Also drop
references to `-o` from the configuration TOML file.
Fixes#7111
Signed-off-by: Greg Kurz <groug@kaod.org>
When this option is enabled the runtime will attempt to determine the
appropriate sandbox size (memory, CPU) before booting the virtual
machine.
As TEEs do not support memory and CPU hotplug, this approach must be
used.
Fixes: #6818
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's specifically name the `gpu` runtime class as `nvidia-gpu`. By
doing this we keep the door open and ease the life of the next vendor
adding GPU support for Kata Containers.
Fixes: #6553
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
SNP requires many specific configurations, so let's make
a new SNP configuration file that we can use with the
kata-qemu-snp runtime class.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
Signed-off-by: Alex Carter <Alex.Carter@ibm.com>
Adding config file that can be used with qemu-sev runtime class.
Since SEV has limited hotplug support, increase
the pod overhead to account for fixed resource usage.
Fixes: #6572
Signed-off-by: Unmesh Deodhar <udeodhar@amd.com>
We need to set hotplug on pci root port and enable at least one
root port. Also set the guest-hooks-dir to the correct path
Fixes: #6675
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
When testing on AKS, we've been hitting the dial_timeout every now and
then. Let's increase it to 45 seconds (instead of 30) for all the VMMs,
and to 60 seconfs in case of TEEs.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As the QEMU configuration for TDX differs quite a lot from the normal
QEMU configuration, let's add a new configuration file for the QEMU TDX.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This change enables to run cloud-hypervisor VMM using a non-root user
when rootless flag is set true in the configuration
Fixes: #2567
Signed-off-by: Feng Wang <fwang@confluent.io>
For kata containers, rootfs is used in the read-only way.
EROFS can noticably decrease metadata overhead.
On the basis of supporting the EROFS file system, it supports using the config parameter to switch the file system used by rootfs.
Fixes: #6063
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
Pass SELinux policy for containers to the agent if `disable_guest_selinux`
is set to `false` in the runtime configuration. The `container_t` type
is applied to the container process inside the guest by default.
Users can also set a custom SELinux policy to the container process using
`guest_selinux_label` in the runtime configuration. This will be an
alternative configuration of Kubernetes' security context for SELinux
because users cannot specify the policy in Kata through Kubernetes's security
context. To apply SELinux policy to the container, the guest rootfs must
be CentOS that is created and built with `SELINUX=yes`.
Fixes: #4812
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
The default vhost-user-fs queue-size of qemu is 128 now. Set it to 1024
by default which is same as clh. Also make this value configurable.
Fixes: #5694
Signed-off-by: liyuxuan.darfux <liyuxuan.darfux@bytedance.com>
This is based on a patch from @niteeshkd that adds a config
parameter to choose between AMD SEV and SEV-SNP VMs as the
confidential guest type in case both types are supported. SEV is
the default.
Signed-off-by: Joana Pecholt <joana.pecholt@aisec.fraunhofer.de>
Adds initrd configuration option to the configuration.toml that is
generated for the setup using QEMU.
Signed-off-by: Joana Pecholt <joana.pecholt@aisec.fraunhofer.de>
When booting the TDX kernel with `tdx_disable_filter`, as it's been done
for QEMU, VirtioFS can work without any issues.
Whether this will be part of the upstream kernel or not is a different
story, but it easily could make it there as Cloud Hypervisor relies on
the VIRTIO_F_IOMMU_PLATFORM feature, which forces the guest to use the
DMA API, making these devices compatible with TDX.
See Sebastien Boeuf's explanation of this in the
3c973fa7ce208e7113f69424b7574b83f584885d commit:
"""
By using DMA API, the guest triggers the TDX codepath to share some of
the guest memory, in particular the virtqueues and associated buffers so
that the VMM and vhost-user backends/processes can access this memory.
"""
Fixes: #4977
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This configuration will allow users to choose between different
I/O backends for qemu, with the default being io_uring.
This will allow users to fallback to a different I/O mechanism while
running on kernels olders than 5.1.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Enable Kata runtime to handle `disable_selinux` flag properly in order
to be able to change the status by the runtime configuration whether the
runtime applies the SELinux label to VMM process.
Fixes: #4599
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Expose the newly added `default_maxmemory` to the project's Makefile and
to the configuration files.
Fixes: #4516
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Enable "-sandbox on" in qemu can introduce another protect layer
on the host, to make the secure container more secure.
The default option is disable because this feature may introduce some
performance cost, even though user can enable
/proc/sys/net/core/bpf_jit_enable to reduce the impact.
Fixes: #2266
Signed-off-by: Feng Wang <feng.wang@databricks.com>
With everything implemented, let's now expose the disk rate limiter
configuration options in the Cloud Hypervisor configuration file.
Fixes: #4139
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With everything implemented, let's now expose the net rate limiter
configuration options in the Cloud Hypervisor configuration file.
Fixes: #4017
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
virtiofsd's debug will be enabled if hypervisor's debug has been
enabled, this will generate too many noisy logs from virtiofsd.
Unbind the relationship of log level between virtiofsd and
hypervisor, if users want to see debug log of virtiofsd,
can set it by:
virtio_fs_extra_args = ["-o", "log_level=debug"]
Fixes: #3303
Signed-off-by: bin <bin@hyper.sh>
This configuration option is valid for all the hypervisor that are going
to be used with the confidential containers effort, thus exposing the
configuration option for Cloud Hypervisor as well.
Fixes: #4022
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This change introduces the `disable_guest_empty_dir` config option,
which allows the user to change whether a Kubernetes emptyDir volume is
created on the guest (the default, for performance reasons), or the host
(necessary if you want to pass data from the host to a guest via an
emptyDir).
Fixes#2053
Signed-off-by: Evan Foster <efoster@adobe.com>
kata-containers/pulls#3771 added TDX support for Cloud Hypervisor, but
two big things got overlooked while doing that.
1. virtio-fs, as of now, cannot be part of the trust boundary, so the
Confidential Guest will not be using it.
2. virtio-block hotplug should be enabled in order to use virtio-block
for the rootfs (used with the devmapper plugin).
When trying to use cloud-hypervisor with TDX using virtio-fs, we're
facing the following error on the guest kernel:
```
virtiofs virtio2: device must provide VIRTIO_F_ACCESS_PLATFORM
```
After checking and double-checking with virtiofs and cloud-hypervisor
developers, it happens as confidential containers might put some
limitations on the device, so it can't access all of the guests' memory
and that's where this restriction seems to be coming from. Vivek
mentioned that virtiofsd do not support VIRTIO_F_ACCESS_PLATFORM (aka
VIRTIO_F_IOMMU_PLATFORM) yet, and that for ecrypted guests virtiofs may
not be the best solution at the moment.
@sboeuf put this in a very nice way: "if the virtio-fs driver doesn't
support VIRTIO_F_ACCESS_PLATFORM, then the pages corresponding to the
virtqueues and the buffers won't be marked as SHARED, meaning the VMM
won't have access to it".
Interestingly enough, it works with QEMU, and it may be due to some
change done on the patched QEMU that @devimc is packaging, but we won't
take the path to figure out what was the change and patch
cloud-hypervisor on the same way, because of 1.
Fixes: #3810
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit df8ffecde0, as device
hotplug *is* supported and, more than that, is very much needed when
using virtio-blk instead of virtio-fs.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
virtio-fs, instead of virtio-9p, is the default shared file system type
in case virtio-blk is not used.
Fixes: #3813
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Relying on virtio-block is the *only* way to use Firecracker with Kata
Containers, as shared FS (virtio-{fs,fs-nydus,9p}) is not supported by
Firecracker.
As configuration doesn't make sense to be exposed, we hardcode the
`false` value in the Firecracker configuration structure.
Fixes: #3813
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's clarify that an error will be reported in case confidential_guest
is enabled, but the hardware where Kata Containers is running doesn't
provide the required feature set.
Fixes: #3787
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's use "Intel TDX" rather than just "TDX", as it can ease the
understanding of the terminology.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>