Commit Graph

116 Commits

Author SHA1 Message Date
Greg Kurz
1f16b6627b runtime/qemu: Rework QMP/HMP support
PR #6146 added the possibility to control QEMU with an extra HMP socket
as an aid for debugging. This is great for development or bug chasing
but this raises some concerns in production.

The HMP monitor allows to temper with the VM state in a variety of ways.
This could be intentionally or mistakenly used to inject subtle bugs in
the VM that would be extremely hard if not even impossible to debug. We
definitely don't want that to be enabled by default.

The feature is currently wired to the `enable_debug` setting in the
`[hypervisor.qemu]` section of the configuration file. This setting has
historically been used to control "debug output" and it is used as such
by some downstream users (e.g. Openshift). Forcing people to have the
extra HMP backdoor at the same time is abusive and dangerous.

A new `extra_monitor_socket` is added to `[hypervisor.qemu]` to give
fine control on whether the HMP socket is wanted or not. This setting
is still gated by `enable_debug = true` to make it clear it is for
debug only. The default is to not have the HMP socket though. This
isn't backward compatible with #6416 but it is for the sake of "better
safe than sorry".

An extra monitor socket makes the QEMU instance untrusted. A warning is
thus logged to the journal when one is requested.

While here, also allow the user to choose between HMP and QMP for the
extra monitor socket. Motivation is that QMP offers way more options to
control or introspect the VM than HMP does. Users can also ask for
pretty json formatting well suited for human reading. This will improve
the debugging experience.

This feature is only made visible in the base and GPU configurations
of QEMU for now.

Fixes #7952

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-09-18 12:13:01 +02:00
Jeremi Piotrowski
bfc93927fb runtime: Remove redundant check in checkPCIeConfig
There is no way for this branch to be hit, as port is only set when it is
different than config.NoPort.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
2023-09-14 14:23:28 +02:00
Jeremi Piotrowski
fc51e4b9eb runtime: Check config for supported CLH (cold|hot)_plug_vfio values
The only supported options are hot_plug_vfio=root-port or no-port.
cold_plug_vfio not supported yet.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
2023-09-14 14:23:28 +02:00
Zvonko Kaiser
1fc715bc65 s390x: Add AP Attach/Detach test
Now that we have propper AP device support add a
unit test for testing the correct Attach/Detach of AP devices.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-23 13:44:19 +00:00
Zvonko Kaiser
dd422ccb69 vfio: Remove obsolete HotplugVFIOonRootBus
Removing HotplugVFIOonRootBus which is obsolete with the latest PCI
topology changes, users can set cold_plug_vfio or hot_plug_vfio either
in the configuration.toml or via annotations.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-20 07:25:40 +00:00
Peng Tao
581be92b25 Merge pull request #4492 from zvonkok/pcie-topology
runtime: fix PCIe topology for GPUDirect use-case
2023-07-03 09:17:12 +08:00
Fabiano Fidêncio
6a21e20c63 runtime: Add "none" as a shared_fs option
Currently, even when using devmapper, if the VMM supports virtio-fs /
virtio-9p, that's used to share a few files between the host and the
guest.

This *needed*, as we need to share with the guest contents like secrets,
certificates, and configurations, via Kubernetes objects like configMaps
or secrets, and those are rotated and must be updated into the guest
whenever the rotation happens.

However, there are still use-cases users can live with just copying
those files into the guest at the pod creation time, and for those
there's absolutely no need to have a shared filesystem process running
with no extra obvious benefit, consuming memory and even increasing the
attack surface used by Kata Containers.

For the case mentioned above, we should allow users, making it very
clear which limitations it'll bring, to run Kata Containers with
devmapper without actually having to use a shared file system, which is
already the approach taken when using Firecracker as the VMM.

Fixes: #7207

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-06-30 20:45:00 +02:00
Zvonko Kaiser
8f0d4e2612 vfio: Cleanup of Cold and Hot Plug
Removed the configuration of PCIeRootPort and PCIeSwitchPort, those
values can be deduced in createPCIeTopology

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
b1aa8c8a24 gpu: Moved the PCIe configs to drivers
The hypervisor_state file was the wrong location for the PCIe Port
settings, moved everything under device umbrella, where it can be
consumed more easily and we do not get into circular deps.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
55a66eb7fb gpu: Add config to TOML
Update cold-plug and hot-plug setting to include bridge, root and
switch-port

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
da42801c38 gpu: Add config settings tests for hot-plug
Updated all references and config settings for hot-plug to match
cold-plug

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
de39fb7d38 runtime: Add support for GPUDirect and GPUDirect RDMA PCIe topology
Fixes: #4491

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
13d7f39c71 gpu: Check for VFIO port assignments
Bailing out early if the port is wrong, allowed port settings are
no-port, root-port, switch-port

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-05-03 12:32:33 +00:00
Zvonko Kaiser
6107c32d70 gpu: Assign default value to cold-plug
Make sure the configuration is propagated to the right structs
and the default value is assigned.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
377ebc2ad1 gpu: Add configuration option for cold-plug VFIO
Users can set cold-plug="root-port" to cold plug a VFIO device in QEMU

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Fabiano Fidêncio
50ce33b02d Merge pull request #6205 from fengwang666/non-root-clh
runtime: support non-root for clh
2023-04-11 19:34:00 +02:00
Jianyong Wu
ece5edc641 qemu/arm64: disable image nvdimm if no firmware offered
For now, image nvdimm on qemu/arm64 depends on UEFI/ACPI, so if there
is no firmware offered, it should be disabled.

Fixes: #6468
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-03-20 18:03:05 +08:00
Feng Wang
cbe6ad9034 runtime: support non-root for clh
This change enables to run cloud-hypervisor VMM using a non-root user
when rootless flag is set true in the configuration

Fixes: #2567

Signed-off-by: Feng Wang <fwang@confluent.io>
2023-02-22 13:57:09 -08:00
zhaojizhuang
ca02c9f512 runtime: add reconnect timeout for vhost user block
Fixes: #6075
Signed-off-by: zhaojizhuang <571130360@qq.com>
2023-02-13 14:33:46 +08:00
yaoyinnan
bdf20b5d26 rootfs: support EROFS filesystem
For kata containers, rootfs is used in the read-only way.
EROFS can noticably decrease metadata overhead.

On the basis of supporting the EROFS file system, it supports using the config parameter to switch the file system used by rootfs.

Fixes: #6063

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
2023-02-11 00:44:13 +08:00
Eric Ernst
6ee550e9a5 runtime: vCPUs pinning is sandbox specific, not hypervisor
While at it, make sure we persist this and fix a misc typo.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-12 15:44:25 -08:00
Manabu Sugimoto
c617bbe70d runtime: Pass SELinux policy for containers to the agent
Pass SELinux policy for containers to the agent if `disable_guest_selinux`
is set to `false` in the runtime configuration. The `container_t` type
is applied to the container process inside the guest by default.
Users can also set a custom SELinux policy to the container process using
`guest_selinux_label` in the runtime configuration. This will be an
alternative configuration of Kubernetes' security context for SELinux
because users cannot specify the policy in Kata through Kubernetes's security
context. To apply SELinux policy to the container, the guest rootfs must
be CentOS that is created and built with `SELINUX=yes`.

Fixes: #4812

Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
2022-11-29 19:07:56 +09:00
Fabiano Fidêncio
df3d9878d5 Merge pull request #5695 from darfux/virtiofs-queue-size
runtime: Support virtiofs queue size for qemu and make it configurable
2022-11-22 20:04:30 +01:00
liyuxuan.darfux
3bb145c63a runtime: Support virtiofs queue size for qemu and make it configurable
The default vhost-user-fs queue-size of qemu is 128 now. Set it to 1024
by default which is same as clh. Also make this value configurable.

Fixes: #5694

Signed-off-by: liyuxuan.darfux <liyuxuan.darfux@bytedance.com>
2022-11-19 15:38:11 +08:00
Fabiano Fidêncio
d94718fb30 runtime: Fix gofmt issues
It seems that bumping the version of golang and golangci-lint new format
changes are required.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-11-17 14:16:12 +01:00
Fabiano Fidêncio
16b8375095 golang: Stop using io/ioutils
The package has been deprecated as part of 1.16 and the same
functionality is now provided by either the io or the os package.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-11-17 13:43:25 +01:00
LitFlwr0
2508d39b7c runtime: added vcpus pinning logics
Core VCPU threads pinning logics for issue 4476. Also provided docs.

Fixes:#4476
Signed-off-by: LitFlwr0 <861690705@qq.com>
2022-11-04 17:52:42 +08:00
Joana Pecholt
ded60173d4 runtime: Enable choice between AMD SEV and SNP
This is based on a patch from @niteeshkd that adds a config
parameter to choose between AMD SEV and SEV-SNP VMs as the
confidential guest type in case both types are supported. SEV is
the default.

Signed-off-by: Joana Pecholt <joana.pecholt@aisec.fraunhofer.de>
2022-09-16 17:51:41 +02:00
Archana Shinde
7d52934ec1 Merge pull request #4798 from amshinde/use-iouring-qemu
Use iouring for qemu block devices
2022-08-26 04:00:24 +05:30
Archana Shinde
ed0f1d0b32 config: Add "block_device_aio" as a config option for qemu
This configuration will allow users to choose between different
I/O backends for qemu, with the default being io_uring.
This will allow users to fallback to a different I/O mechanism while
running on kernels olders than 5.1.

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
2022-08-05 13:16:34 -07:00
wllenyj
274598ae56 kata-runtime: add dragonball config check support.
add dragonball config check support.

Signed-off-by: wllenyj <wllenyj@linux.alibaba.com>
2022-07-14 10:43:50 +08:00
Manabu Sugimoto
4d89476c91 runtime: Fix DisableSelinux config
Enable Kata runtime to handle `disable_selinux` flag properly in order
to be able to change the status by the runtime configuration whether the
runtime applies the SELinux label to VMM process.

Fixes: #4599
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
2022-07-06 15:50:28 +09:00
Fabiano Fidêncio
071dd4c790 Merge pull request #4109 from pmores/drop-in-cfg-files-support
Drop in cfg files support
2022-07-05 22:21:24 +02:00
Fabiano Fidêncio
aa561b49f5 Merge pull request #4540 from fidencio/topic/default_maxmemory
Add `default_maxmemory` config option
2022-06-30 12:08:15 +02:00
Pavel Mores
99f5ca80fc runtime: Plug drop-in decoding into decodeConfig()
Fixes #4108

Signed-off-by: Pavel Mores <pmores@redhat.com>
2022-06-29 09:54:38 +02:00
Pavel Mores
0f9856c465 runtime: Scan drop-in directory, read files and decode them
updateFromDropIn() uses the infrastructure built by previous commits to
ensure no contents of 'tomlConfig' are lost during decoding.   To do
this, we preserve the current contents of our tomlConfig in a clone and
decode a drop-in into the original.  At this point, the original
instance is updated but its Agent and/or Hypervisor fields are
potentially damaged.

To merge, we update the clone's Agent/Hypervisor from the original
instance.   Now the clone has the desired Agent/Hypervisor and the
original instance has the rest, so to finish, we just need to move the
clone's Agent/Hypervisor to the original.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2022-06-29 09:54:38 +02:00
Pavel Mores
2c1efcc697 runtime: Add helpers to copy fields between tomlConfig instances
These functions take a TOML key - an array of individual components,
e.g. ["agent" "kata" "enable_tracing"], as returned by BurntSushi - and
two 'tomlConfig' instances.  They copy the value of the struct field
identified by the key from the source instance to the target one if
necessary.

This is only done if the TOML key points to structures stored in
maps by 'tomlConfig', i.e. 'hypervisor' and 'agent'.  Nothing needs to
be done in other cases.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2022-06-29 09:54:38 +02:00
Pavel Mores
20f11877be runtime: Add framework to manipulate config structs via reflection
For 'tomlConfig' substructures stored in Golang maps - 'hypervisor' and
'agent' - BurntSushi doesn't preserve their previous contents as it does
for substructures stored directly (e.g. 'runtime').  We use reflection
to work around this.

This commit adds three primitive operations to work with struct fields
identified by their `toml:"..."` tags - one to get a field value, one to
set a field value and one to assign a source struct field value to the
corresponding field of a target.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2022-06-29 09:54:38 +02:00
Fabiano Fidêncio
afdc960424 hypervisor: Add default_maxmemory configuration
Let's add a `default_maxmemory` configuration, which allows the admins
to set the maximum amount of memory to be used by a VM, considering the
initial amount + whatever ends up being hotplugged via the pod limits.

By default this value is 0 (zero), and it means that the whole physical
RAM is the limit.

Fixes: #4516

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-28 08:32:15 +02:00
Eric Ernst
469e098543 katautils: don't do validation when loading hypervisor config
Policy for whats valid/invalid within the config varies by VMM, host,
and by silicon architecture. Let's keep katautils simple for just
translating a toml to the hypervisor config structure, and leave
validation to virtcontainers.

Without this change, we're doing duplicate validation.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-06-27 10:13:26 -07:00
Bin Liu
27b1bb5ed9 Merge pull request #4467 from egernst/device-pkg
device package cleanup/refactor
2022-06-27 14:40:53 +08:00
Eric Ernst
f9e96c6506 runtime: device: move to top level package
Let's move device package to runtime/pkg instead of being buried under
virtcontainers.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-06-26 21:31:29 -07:00
Liang Zhou
ef925d40ce runtime: enable sandbox feature on qemu
Enable "-sandbox on" in qemu can introduce another protect layer
on the host, to make the secure container more secure.

The default option is disable because this feature may introduce some
performance cost, even though user can enable
/proc/sys/net/core/bpf_jit_enable to reduce the impact.

Fixes: #2266

Signed-off-by: Feng Wang <feng.wang@databricks.com>
2022-06-17 15:30:46 -07:00
Snir Sheriber
c67b9d2975 qemu: allow using legacy serial device for the console
This allows to get guest early boot logs which are usually
missed when virtconsole is used.
- It utilizes previous work on the govmm side:
https://github.com/kata-containers/govmm/pull/203
- unit test added

Fixes: #4237
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-17 12:06:11 +03:00
Fabiano Fidêncio
511f7f822d config: Add DiskRateLimiter* to Cloud Hypervisor
Let's add the newly added disk rate limiter configurations to the Cloud
Hypervisor's hypervisor configuration.

Right now those are not used anywhere, and there's absolutely no way the
users can set those up.  That's coming later in this very same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:27:15 +02:00
Fabiano Fidêncio
5b18575dfe hypervisor: Add disk bandwidth and operations rate limiters
This is the disk counterpart of the what was introduced for the network
as part of the previous commits in this series.

The newly added fields are:
* DiskRateLimiterBwMaxRate, defined in bits per second, which is used to
  control the network I/O bandwidth at the VM level.
* DiskRateLimiterBwOneTimeBurst, also defined in bits per second, which
  is used to define an *initial* max rate, which doesn't replenish.
* DiskRateLimiterOpsMaxRate, the operations per second equivalent of the
  DiskRateLimiterBwMaxRate.
* DiskRateLimiterOpsOneTimeBurst, the operations per second equivalent of
  the DiskRateLimiterBwOneTimeBurst.

For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:27:11 +02:00
Fabiano Fidêncio
c9f6496d6d config: Add NetRateLimiter* to Cloud Hypervisor
Let's add the newly added network rate limiter configurations to the
Cloud Hypervisor's hypervisor configuration.

Right now those are not used anywhere, and there's absolutely no way the
users can set those up.  That's coming later in this very same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:22:42 +02:00
Fabiano Fidêncio
2d35e6066d hypervisor: Add network bandwidth and operations rate limiters
In a similar way to what's already exposed as RxRateLimiterMaxRate and
TxRateLimiterMaxRate, let's add four new fields to the Hypervisor's
configuration.

The values added are related to bandwidth and operations rate limiters,
which have to be added so we can expose I/O throttling configurations to
users using Cloud Hypervisor as their preferred VMM.

The reason we cannot simply re-use {Rx,Tx}RateLimiterMaxRate is because
Cloud Hypervisor exposes a single MaxRate to be used for both inbound
and outbound queues.

The newly added fields are:
* NetRateLimiterBwMaxRate, defined in bits per second, which is used to
  control the network I/O bandwidth at the VM level.
* NetRateLimiterBwOneTimeBurst, also defined in bits per second, which
  is used to define an *initial* max rate, which doesn't replenish.
* NetRateLimiterOpsMaxRate, the operations per second equivalent of the
  NetRateLimiterBwMaxRate.
* NetRateLimiterOpsOneTimeBurst, the operations per second equivalent of
  the NetRateLimiterBwOneTimeBurst.

For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:22:42 +02:00
Jakob Naucke
ff17c756d2 runtime: Allow and require no initrd for SE
Previously, it was not permitted to have neither an initrd nor an image.
However, this is the exact config to use for Secure Execution, where the
initrd is part of the image to be specified as `-kernel`. Require the
configuration of no initrd for Secure Execution.

Also
- remove redundant code for image/initrd checking -- no need to check in
  `newQemuHypervisorConfig` (calling) when it is also checked in
  `getInitrdAndImage` (called)
- use `QemuCCWVirtio` constant when possible

Fixes: #3922
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2022-03-25 18:36:12 +01:00
Evan Foster
afc567a9ae storage: make k8s emptyDir creation configurable
This change introduces the `disable_guest_empty_dir` config option,
which allows the user to change whether a Kubernetes emptyDir volume is
created on the guest (the default, for performance reasons), or the host
(necessary if you want to pass data from the host to a guest via an
emptyDir).

Fixes #2053

Signed-off-by: Evan Foster <efoster@adobe.com>
2022-03-04 12:02:42 -08:00