Commit Graph

104 Commits

Author SHA1 Message Date
Zvonko Kaiser
fbacc09646 gpu: PCIe topology, consider vhost-user-block in Virt
In Virt the vhost-user-block is an PCIe device so
we need to make sure to consider it as well. We're keeping
track of vhost-user-block devices and deduce the correct
amount of PCIe root ports.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-15 17:39:55 +00:00
Zvonko Kaiser
8f0d4e2612 vfio: Cleanup of Cold and Hot Plug
Removed the configuration of PCIeRootPort and PCIeSwitchPort, those
values can be deduced in createPCIeTopology

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
b1aa8c8a24 gpu: Moved the PCIe configs to drivers
The hypervisor_state file was the wrong location for the PCIe Port
settings, moved everything under device umbrella, where it can be
consumed more easily and we do not get into circular deps.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
55a66eb7fb gpu: Add config to TOML
Update cold-plug and hot-plug setting to include bridge, root and
switch-port

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
da42801c38 gpu: Add config settings tests for hot-plug
Updated all references and config settings for hot-plug to match
cold-plug

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
de39fb7d38 runtime: Add support for GPUDirect and GPUDirect RDMA PCIe topology
Fixes: #4491

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
dded731db3 gpu: Add OVMF setting for MMIO aperture
The default size of OVMFs aperture is too low to
initialized PCIe devices with huge BARs

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
c8cf7ed3bc gpu: Add ColdPlug of VFIO devices with devManager
If we have a VFIO device and cold-plug is enabled
we mark each device as ColdPlug=true and let the VFIO
module do the attaching.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
e2b5e7f73b gpu: Add Rawdevices to hypervisor
RawDevics are used to get PCIe device info early before the sandbox
is started to make better PCIe topology decisions

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
377ebc2ad1 gpu: Add configuration option for cold-plug VFIO
Users can set cold-plug="root-port" to cold plug a VFIO device in QEMU

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
zhaojizhuang
ca02c9f512 runtime: add reconnect timeout for vhost user block
Fixes: #6075
Signed-off-by: zhaojizhuang <571130360@qq.com>
2023-02-13 14:33:46 +08:00
yaoyinnan
bdf20b5d26 rootfs: support EROFS filesystem
For kata containers, rootfs is used in the read-only way.
EROFS can noticably decrease metadata overhead.

On the basis of supporting the EROFS file system, it supports using the config parameter to switch the file system used by rootfs.

Fixes: #6063

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
2023-02-11 00:44:13 +08:00
zhaojizhuang
9092c23a2e runtime: Add hmp for qemu
Fixes: #6092
Signed-off-by: zhaojizhuang <571130360@qq.com>
2023-01-29 14:22:04 +08:00
Eric Ernst
6ee550e9a5 runtime: vCPUs pinning is sandbox specific, not hypervisor
While at it, make sure we persist this and fix a misc typo.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-12 15:44:25 -08:00
Bin Liu
86a82cace9 runtime: change cache mode from none to never
New Rust virtiofsd's `cache` mode doesn't support `none` mode,
we should use `never` to replace it.

Fixes: #6018

Signed-off-by: Bin Liu <bin@hyper.sh>
2023-01-10 17:29:48 +08:00
Eric Ernst
9ec8a13985 virtcontainers: introduce hypervisor_darwin
Fixes: #5995

Placeholder skeleton at this point - implementation will be added after
basic build refactoring lands.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-06 02:03:34 -08:00
Manabu Sugimoto
c617bbe70d runtime: Pass SELinux policy for containers to the agent
Pass SELinux policy for containers to the agent if `disable_guest_selinux`
is set to `false` in the runtime configuration. The `container_t` type
is applied to the container process inside the guest by default.
Users can also set a custom SELinux policy to the container process using
`guest_selinux_label` in the runtime configuration. This will be an
alternative configuration of Kubernetes' security context for SELinux
because users cannot specify the policy in Kata through Kubernetes's security
context. To apply SELinux policy to the container, the guest rootfs must
be CentOS that is created and built with `SELINUX=yes`.

Fixes: #4812

Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
2022-11-29 19:07:56 +09:00
liyuxuan.darfux
3bb145c63a runtime: Support virtiofs queue size for qemu and make it configurable
The default vhost-user-fs queue-size of qemu is 128 now. Set it to 1024
by default which is same as clh. Also make this value configurable.

Fixes: #5694

Signed-off-by: liyuxuan.darfux <liyuxuan.darfux@bytedance.com>
2022-11-19 15:38:11 +08:00
LitFlwr0
2508d39b7c runtime: added vcpus pinning logics
Core VCPU threads pinning logics for issue 4476. Also provided docs.

Fixes:#4476
Signed-off-by: LitFlwr0 <861690705@qq.com>
2022-11-04 17:52:42 +08:00
Peng Tao
8a2df6b31c Merge pull request #4931 from jpecholt/snp-support
Added SNP-Support for Kata-Containers
2022-09-27 14:17:54 +08:00
Joana Pecholt
ded60173d4 runtime: Enable choice between AMD SEV and SNP
This is based on a patch from @niteeshkd that adds a config
parameter to choose between AMD SEV and SEV-SNP VMs as the
confidential guest type in case both types are supported. SEV is
the default.

Signed-off-by: Joana Pecholt <joana.pecholt@aisec.fraunhofer.de>
2022-09-16 17:51:41 +02:00
Joana Pecholt
22bda0838c runtime: Support for AMD SEV-SNP VMs
This commit adds AMD SEV-SNP as a confidential guest option to the
runtime. Information on required components such as OVMF, QEMU and
a kernel supporting SEV-SNP are defined in the versions file and
corresponding configs are added.

Note: The CPU model 'host' provided by the current SNP-QEMU does
not support all SNP capabilities yet, which is why this option is
changed to EPYC-v4.

Note: The guest's physical address space reduction specified with
ReducedPhysBits is 1. Details are can be found in Section 15.34.6
here https://www.amd.com/system/files/TechDocs/24593.pdf

Fixes #4437

Signed-off-by: Joana Pecholt <joana.pecholt@aisec.fraunhofer.de>
2022-09-16 17:51:41 +02:00
Feng Wang
f914319874 runtime: store the user name in hypervisor config
The user name will be used to delete the user instead of relying on
uid lookup because uid can be reused.

Fixes: #5155

Signed-off-by: Feng Wang <feng.wang@databricks.com>
2022-09-13 10:32:55 -07:00
Eric Ernst
e0142db24f hypervisor: Add GetTotalMemoryMB to interface
It'll be useful to get the total memory provided to the guest
(hotplugged + coldplugged). We'll use this information when calcualting
how much memory we can add at a time when utilizing ACPI hotplug.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-08-30 16:37:47 -07:00
Archana Shinde
7d52934ec1 Merge pull request #4798 from amshinde/use-iouring-qemu
Use iouring for qemu block devices
2022-08-26 04:00:24 +05:30
Archana Shinde
ed0f1d0b32 config: Add "block_device_aio" as a config option for qemu
This configuration will allow users to choose between different
I/O backends for qemu, with the default being io_uring.
This will allow users to fallback to a different I/O mechanism while
running on kernels olders than 5.1.

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
2022-08-05 13:16:34 -07:00
wllenyj
274598ae56 kata-runtime: add dragonball config check support.
add dragonball config check support.

Signed-off-by: wllenyj <wllenyj@linux.alibaba.com>
2022-07-14 10:43:50 +08:00
Fabiano Fidêncio
aa561b49f5 Merge pull request #4540 from fidencio/topic/default_maxmemory
Add `default_maxmemory` config option
2022-06-30 12:08:15 +02:00
Fabiano Fidêncio
323271403e virtcontainers: Remove unused function
While working on the previous commits, some of the functions become
non-used.  Let's simply remove them.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-28 21:19:24 +02:00
Fabiano Fidêncio
afdc960424 hypervisor: Add default_maxmemory configuration
Let's add a `default_maxmemory` configuration, which allows the admins
to set the maximum amount of memory to be used by a VM, considering the
initial amount + whatever ends up being hotplugged via the pod limits.

By default this value is 0 (zero), and it means that the whole physical
RAM is the limit.

Fixes: #4516

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-28 08:32:15 +02:00
Eric Ernst
bdf5e5229b virtcontainers: validate hypervisor config outside of hypervisor itself
Depending on the user of it, the hypervisor from hypervisor interface
could have differing view on what is valid or not. To help decouple,
let's instead check the hypervisor config validity as part of the
sandbox creation, rather than as part of the CreateVM call within the
hypervisor interface implementation.

Fixes: #4251

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-06-27 11:53:41 -07:00
Bin Liu
27b1bb5ed9 Merge pull request #4467 from egernst/device-pkg
device package cleanup/refactor
2022-06-27 14:40:53 +08:00
Eric Ernst
f9e96c6506 runtime: device: move to top level package
Let's move device package to runtime/pkg instead of being buried under
virtcontainers.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-06-26 21:31:29 -07:00
Liang Zhou
ef925d40ce runtime: enable sandbox feature on qemu
Enable "-sandbox on" in qemu can introduce another protect layer
on the host, to make the secure container more secure.

The default option is disable because this feature may introduce some
performance cost, even though user can enable
/proc/sys/net/core/bpf_jit_enable to reduce the impact.

Fixes: #2266

Signed-off-by: Feng Wang <feng.wang@databricks.com>
2022-06-17 15:30:46 -07:00
Snir Sheriber
c67b9d2975 qemu: allow using legacy serial device for the console
This allows to get guest early boot logs which are usually
missed when virtconsole is used.
- It utilizes previous work on the govmm side:
https://github.com/kata-containers/govmm/pull/203
- unit test added

Fixes: #4237
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-17 12:06:11 +03:00
Fabiano Fidêncio
5b18575dfe hypervisor: Add disk bandwidth and operations rate limiters
This is the disk counterpart of the what was introduced for the network
as part of the previous commits in this series.

The newly added fields are:
* DiskRateLimiterBwMaxRate, defined in bits per second, which is used to
  control the network I/O bandwidth at the VM level.
* DiskRateLimiterBwOneTimeBurst, also defined in bits per second, which
  is used to define an *initial* max rate, which doesn't replenish.
* DiskRateLimiterOpsMaxRate, the operations per second equivalent of the
  DiskRateLimiterBwMaxRate.
* DiskRateLimiterOpsOneTimeBurst, the operations per second equivalent of
  the DiskRateLimiterBwOneTimeBurst.

For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:27:11 +02:00
Fabiano Fidêncio
2d35e6066d hypervisor: Add network bandwidth and operations rate limiters
In a similar way to what's already exposed as RxRateLimiterMaxRate and
TxRateLimiterMaxRate, let's add four new fields to the Hypervisor's
configuration.

The values added are related to bandwidth and operations rate limiters,
which have to be added so we can expose I/O throttling configurations to
users using Cloud Hypervisor as their preferred VMM.

The reason we cannot simply re-use {Rx,Tx}RateLimiterMaxRate is because
Cloud Hypervisor exposes a single MaxRate to be used for both inbound
and outbound queues.

The newly added fields are:
* NetRateLimiterBwMaxRate, defined in bits per second, which is used to
  control the network I/O bandwidth at the VM level.
* NetRateLimiterBwOneTimeBurst, also defined in bits per second, which
  is used to define an *initial* max rate, which doesn't replenish.
* NetRateLimiterOpsMaxRate, the operations per second equivalent of the
  NetRateLimiterBwMaxRate.
* NetRateLimiterOpsOneTimeBurst, the operations per second equivalent of
  the NetRateLimiterBwOneTimeBurst.

For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:22:42 +02:00
Jakob Naucke
ff17c756d2 runtime: Allow and require no initrd for SE
Previously, it was not permitted to have neither an initrd nor an image.
However, this is the exact config to use for Secure Execution, where the
initrd is part of the image to be specified as `-kernel`. Require the
configuration of no initrd for Secure Execution.

Also
- remove redundant code for image/initrd checking -- no need to check in
  `newQemuHypervisorConfig` (calling) when it is also checked in
  `getInitrdAndImage` (called)
- use `QemuCCWVirtio` constant when possible

Fixes: #3922
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2022-03-25 18:36:12 +01:00
Tanweer Noor
082d538cb4 runtime: make selinux configurable
removes --tags selinux handling in the makefile (part of it introduced here: d78ffd6)
and makes selinux configurable via configuration.toml

Fixes: #3631
Signed-off-by: Tanweer Noor <tnoor@apple.com>
2022-02-25 10:33:46 -08:00
Fabiano Fidêncio
28c4c044e6 hypervisors: Confidential Guests do not support VCPUs hotplug
As confidential guests do not support VCPUs hotplug, let's set the
"DefaultMaxVCPUs" value to "NumVCPUs".

The reason to do this is to ensure that guests will be started with the
correct amount of VCPUs, without giving to the guest with all the
possible VCPUs the host could provide.

One clear side effect of this limitation is that workloads that would
require more VCPUs on their yaml definition will not run on this
scenario.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-02-25 16:49:21 +01:00
Samuel Ortiz
26b3f0017c virtcontainers: Split hypervisor into Linux and OS agnostic bits
Keep all the OS agnostic bits in the hypervisor.go and
hypervisor_ARCH.go files.

Fixes #3614

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:15:31 +01:00
Samuel Ortiz
c91035d0e1 virtcontainers: Move non QEMU specific constants to hypervisor.go
Hotplugging errors and 9pfs size are not particularily QEMU specific.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:07:34 +01:00
Samuel Ortiz
10ae05914c virtcontainers: Move guest protection definitions to hypervisor.go
They're not QEMU specific, other VMMs may implement support for it.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:07:31 +01:00
Samuel Ortiz
b28d0274ff virtcontainers: Make max vCPU config less QEMU specific
Even though it's still actually defined as the QEMU upper bound,
it's now abstracted away through govmm.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:06:32 +01:00
Samuel Ortiz
e0b264430d virtcontainers: Define a Network interface
And move the Linux implementation into a GOOS specific file.

Fixes #3005

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
844eb61992 virtcontainers: Have CreateVM use a Network reference
We are replacing the NetworkingNamespace structure with the Network
one, so we should have the hypervisor interface switching to it as well.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Julio Montes
1f29478b09 runtime: suppport split firmware
firmware can be split into FIRMWARE_VARS.fd (UEFI variables as
configuration) and FIRMWARE_CODE.fd (UEFI program image). UEFI
variables can be customized per each user while UEFI code is kept same.

fixes #3583

Signed-off-by: Julio Montes <julio.montes@intel.com>
2022-02-01 13:40:19 -06:00
Julio Montes
49223e67af runtime: remove enable_swap option
`enable_swap` option was added long time ago to add
`-realtime mlock=off` to the QEMU's command line.
Kata now supports QEMU 6, `-realtime` option has been deprecated and
`mlock=on` is causing unexpected behaviors in kata.
This patch removes support for `enable_swap`, `-realtime` and `mlock=`
since they are causing bugs in kata.

Signed-off-by: Julio Montes <julio.montes@intel.com>
2022-01-18 11:12:29 -06:00
Eric Ernst
ce92cadc7d vc: hypervisor: remove setSandbox
The hypervisor interface implementation should not know a thing about
sandboxes.

Fixes: #2882

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2021-11-19 12:20:41 -08:00
Eric Ernst
2227c46c25 vc: hypervisor: use our own logger
This'll end up moving to hypervisors pkg, but let's stop using virtLog,
instead introduce hvLogger.

Fixes: #2884

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2021-11-19 12:20:41 -08:00