[ port from runtime repository commit 4d4a153af5cb145215cb6e6e386eac2bcb8c3e32 ]
Commit b4385901da ("qemu/arm64: Detect host GIC version to configure guest
GIC") reads /proc/interrupts to detect the host gic version.
But on a ThunderX2 host with 224 cpus, the /proc/interrupts is ~762K bytes.
Hence it will costs ~900K bytes memory overhead.
From the go tool pprof results:
flat flat% sum% cum cum%
976.89kB 100% 100% 976.89kB 100% github.com/kata-containers/runtime/virtcontainers.getHostGICVersion
Although the allocated memory will be freed, seems it worthy removing that
for speed up the runtime.
As per [1], there is no perfect way to detect the gic version on host.
At qemu side, if we use "gic-version=host", qemu will automatically detect
the verion by kvm ioctl. So we'd better let qemu determine the gic version.
If the user really want to start vm with gic-verion=2, he/she can set it
in machine_accelerators option.
[1]https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011690.html
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
[ port from runtime repository commit e36389e25ea5aa778be8eb5628a3353bb13305bb ]
After backporting patch series of enabling memory hot remove on aarch64
to v5.4.x, we finally could enable nvdimm/dax on aarch64.
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
[ port from runtime repository commit 7e4704611137b75579696ece6728bd30f705128a ]
If major version matches max supported major, we continue comparing the minor version.
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
The qemuPaths field in qemuArchBase maps from machine type to the default
qemu path. But, by the time we construct it, we already know the machine
type, so that entry ends up being the only one we care about.
So, collapse the map into a single path. As a bonus, the qemuPath()
method can no longer fail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The supportedQemuMachines array in qemuArchBase has a list of all the
qemu machine types supported for the architecture, with the options
for each. But, the machineType field already tells us which of the
machine types we're actually using, and that's the only entry we
actually care about.
So, drop the table, and just have a single value with the machine type
we're actually using. As a bonus that means the machine() method can
no longer fail, so no longer needs an error return.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently, newQemuArch() doesn't return an error. So, if passed an invalid
machine type, it will return a technically valid, but unusable qemuArch
object, which will probably fail with other errors shortly down the track.
Change this, to more cleanly fail the newQemuArch itself, letting us
detect a bad machine type earlier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The last stanza of TestQemuAmd64Bridges is rather odd. It tries to create
a qemu instance with a machine type of (QemuQ35 + QemuPC), or in other
words "q35pc", which isn't a thing.
What it's asserting about this is that the returned bridges list is empty
despite asking for bridges, so it looks like what this is really trying to
test is for sane behaviour when given a bad machine type.
So, split this out into a separate test, and make it explicit for clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Implement tc-based tx rate limiter to control network I/O outbound traffic
on VM level for hypervisors which don't support built-in rate limiter.
We take different actions, based on various inter-networking models.
For tcfilters as inter-networking model, we simply apply htb
qdisc discipline on the virtual netpair.
For other inter-networking models, such as macvtap, we resort to ifb,
by redirecting interface ingress traffic to ifb egress, and then apply htb
to ifb egress.
Fixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Ingress traffic shaping is very limited, and the htb
qdisc discipline couldn't be applied to interface ingress traffic.
Here, we import a new pseudo network interface, Intermediate Functional Block (ifb).
It is an alternative to tc filters for handling ingress traffic, by
redirecting interface ingress traffic to ifb and treat it as egress traffic there.
Fixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
As for hypervisors that support built-in rate limiter, like firecracker,
we use this built-in characteristics to implement rate limiter in kata.
kata-defined rate is in bits with scaling factors of 1000, otherwise fc-defined
rate is in bytes with scaling factors of 1024, so need reversion.
Fixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Implement tc-based rx rate limiter to control network I/O inbound traffic
on VM level for hypervisors which don't support built-in rate limiter.
In some detail, we use HTB(Hierarchical Token Bucket) qdisc shaping schemes
to control host interface egress traffic.
HTB shapes traffic based on the Token Bucket Filter algorithm, and one
fundamental part of the HTB qdisc is the borrowing mechanism.
Children classes borrow tokens from their parents once they have exceeded rate,
it will continue to attempt to borrow until it reaches ceil. See more details in
https://tldp.org/HOWTO/Traffic-Control-HOWTO/classful-qdiscs.htmlFixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
We use tc-based or built-in rate limiter to shape network I/O traffic
and they all must be tied to one specific interface/endpoint.
In order to tell whether we've ever added rate limiter to this interface/endpoint,
we create get/set func to reveal/store such info.
Fixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
As for some hypervisors, like firecracker, they support built-in rate limiter
to control network I/O bandwidth on VMM level. And for some hypervisors, like qemu,
they don't.
Fixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Add configuration/annotation about network I/O throttling on VM level.
rx_rate_limiter_max_rate is dedicated to control network inbound
bandwidth per pod.
tx_rate_limiter_max_rate is dedicated to control network outbound
bandwidth per pod.
Fixes: #250
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
The virtiofs daemon may run into errors other than the file
not existing, e.g. the file may not be executable.
Fixes: #2682
Message is now:
virtiofs daemon /usr/local/bin/hello returned with error:
fork/exec /usr/local/bin/virtiofsd: permission denied
instead of
panic: runtime error: invalid memory address or nil
Fixes: #2582
Message is now:
virtiofs daemon /usr/local/bin/hello-not-found returned with error:
fork/exec /usr/local/bin/hello-not-found: no such file or directory
instead of:
virtiofsd path (/usr/local/bin/hello-no-found) does not exist
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
Call the `pkg/cgroups` package `SetLogger()` function to ensure all its log
records contain all required structured logging fields.
Fixes: #2782
Signed-off-by: Julio Montes <julio.montes@intel.com>
[cherry picked from runtime commit 3c4fe035e8041b44e1f3e06d5247938be9a1db15]
Check if shm mount is backed by empty-dir memory based volume.
If so let the logic to handle epehemeral volumes take care of this
mount, so that shm mount within the container is backed by tmpfs mount
within the the container in the VM.
Fixes: #323
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
[cherry picked from runtime commit d0dbd0485d2f4ec3760f6fa1252ded86a7709042]
Call the `device/config` package `SetLogger()` function to ensure all its log
records contain all required structured logging fields.
Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
[ cherry-picked from runtime commit 13887bf89da9d2d7c215d77ca63129e1813e4c4a ]
Call the `store` packages `SetLogger()` function to ensure all its log
records contain all required structured logging fields.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
We need to make sure containers cannot modify host path unless it is explicitly shared to it. Right now we expose an additional top level shared directory to the guest and allow it to be modified. This is less ideal and can be enhanced by following method:
1. create two directories for each sandbox:
-. /run/kata-containers/shared/sandboxes/$sbx_id/mounts/, a directory to hold all host/guest shared mounts
-. /run/kata-containers/shared/sandboxes/$sbx_id/shared/, a host/guest shared directory (9pfs/virtiofs source dir)
2. /run/kata-containers/shared/sandboxes/$sbx_id/mounts/ is bind mounted readonly to /run/kata-containers/shared/sandboxes/$sbx_id/shared/, so guest cannot modify it
3. host-guest shared files/directories are mounted one-level under /run/kata-containers/shared/sandboxes/$sbx_id/mounts/ and thus present to guest at one level under /run/kata-containers/shared/sandboxes/$sbx_id/shared/
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
When an x86 sandbox has a vIOMMU (needed for VFIO), it needs the
'kernel_irqchip=split' option or it can't start. fdcd1f3a2 attempts to set
that, but ends up just writing it to a temporary (looks like Go for range
loops pass by value).
Fixes: #2694
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
Add a configuration option and a Pod Annotation
If activated:
- Add kernel parameters to load iommu
- Add irqchip=split in the kvm options
- Add a vIOMMU to the VM
Fixes#2694
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
Add a new function appendIOMMU() to the qemuArch interface
and provide an implementation on amd64 architecture.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
The ppc64 specific qemu setup code adds a "pmu=off" parameter to the cpu
model if the nestedRun option is set. But, not only does availability of
the pmu have nothing to do with nesting on POWER, there is no "pmu=" cpu
opton for ppc64 at all.
So, simply remove it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Hard-coded Qemu machine options create challenges when running Kata
with latest Qemu (v5.0) or with latest processor version.
This patch makes it configurable by leveraging the existing machine_accelerators
option in configuration.toml.
This patch fixes#2657 for ppc64le
Signed-off-by: bpradipt@in.ibm.com
The default ppc64le Qemu binary path was specific for Ubuntu.
This patch fixes the default binary path for both Fedora and Ubuntu
Fixes: #2738
Signed-off-by: bpradipt@in.ibm.com
qemu_ppc64le.go applies the "tsc=reliable", "no_timer_check" and
"noreplace-smp" kernel parameters, despite those being x86 specific. So,
just remove them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The Qemu version check in unit test case is no longer needed for
Power since we don't support Kata with Qemu version < 4.x.
Fixes: #315
Signed-off-by: bpradipt@in.ibm.com
Add grpc API for adding arp neighbours for a network
interface. These are expected to be static arp entries
sent by the runtime.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Instead of having the versions.yaml in the runtime source,
it makes more sense to have it in the root directory of
the project.
Signed-off-by: Salvador Fuentes <salvador.fuentes@intel.com>
With the new HTTP API from CLH, it removes the support of multiple
virtio-vsock devices, as the Linux kernel does not support it.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Changes:
96be8229 release: Release v0.7.0
5115ad6e vmm: config: Support on/off/true/false for all booleans
d5bfa2df vmm, vhost_user_block: Make parameter names match --disk
2f0bc06b vmm: Update default devices names as "internal"
aaba6e77 vmm: Add virtio-console to the list of Migratable devices
9ab4bb1a devices: serial: Expect an identifier upon device creation
06487131 vm-virtio: pci: Expect an identifier upon device creation
eeb7e10d vm-virtio: mmio: Expect an identifier upon device creation
9d84ef50 vmm: Make the virtio identifier mandatory
14350f5d devices: ioapic: Expect an identifier upon device creation
55687157 vm-virtio: iommu: Expect an identifier upon device creation
052eff1c vm-virtio: console: Expect an identifier upon device creation
354c2a4b vm-virtio: vhost-user-net: Expect an identifier upon device creation
46e0b3ff vm-virtio: vhost-user-blk: Expect an identifier upon device creation
bb7fa71f vm-virtio: vhost-user-fs: Expect an identifier upon device creation
ec5ff395 vm-virtio: vsock: Expect an identifier upon device creation
9b53044a vm-virtio: mem: Expect an identifier upon device creation
1592a929 vm-virtio: pmem: Expect an identifier upon device creation
2e91b738 vm-virtio: rng: Expect an identifier upon device creation
9eb7413f vm-virtio: net: Expect an identifier upon device creation
be946caf vm-virtio: blk: Expect an identifier upon device creation
ff9c8b84 vmm: Always generate the next device name
81831413 vmm: Add an identifier to the ioapic device
e4386c8b vmm: Add an identifier to the virtio-iommu device
75ddd2a2 vmm: Add an identifier to the --console device
eac350c4 vmm: Add an identifier to the virtio-mem device
6802ef54 vmm: Add an identifier to the --rng device
d71d52e9 vmm: Fix virtio-console creation with virtual IOMMU
b08fde59 vmm: Fix virtio-rng creation with virtual IOMMU
8031ac33 vmm: Fix virtio-vsock creation with virtual IOMMU
50134969 Jenkins: Run musl unit and integration tests on master branch
ce794f78 ci: Pass target triple to the test scripts
33b0e158 resources: Add musl tools and toolchain to the Dockerfile
ad9374bd dev_cli: Add --libc to the build and test commands
8cef3574 vmm: seccomp: Add fork, gettid and pipe2 syscalls to permitted list
ce7678f2 vmm: seccomp: Add tkill syscall to permitted list
12758d7f vmm: seccomp: Add epoll_pwait syscall to permitted list
86fcd19b build: Initial musl support
a5de4955 vmm: Only allow removal of specific types of virtio device
9ed880d7 vmm: Add an identifier to the --fs device
7e0ab6b5 vmm: Fix pmem device creation
3012975c tests: Enhance vsock integration test to support hotplug
6c2bca5f bin: ch-remote: Add support for adding vsock devices
8de7448d vmm: api: Add "add-vsock" API entry point
bf09a1e6 openapi: Add "id" field to VsockConfig
a76cf086 vmm: vm: Remove vsock device from config
99422324 vmm: vm: Add "add_vsock()"
1d61c476 vmm: device_manager: Add support for hotplugging virtio-vsock devices
f8501a3b vmm: config: Move --vsock syntax to VsockConfig
6e049e0d vmm: Add an identifier to the --vsock device
10348f73 vmm, main: Support only zero or one vsock devices
9d1f95a3 openapi: Add missing "id" field
30e2e515 build(deps): bump serde_json from 1.0.51 to 1.0.52
dd9d0d04 build(deps): bump micro_http from `0d87a94` to `c9ffb90`
cdc8493a build(deps): bump thiserror from 1.0.15 to 1.0.16
Fixes: kata-containers/runtime#2658
Signed-off-by: Bo Chen <chen.bo@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Pmem size now is calculated by the hypervisor. This is not required
anymore. Remove it to simplify the code.
Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Changes:
f5debc4 build(deps): bump libssh2-sys from 0.2.16 to 0.2.17
37dfb4c build(deps): bump hermit-abi from 0.1.11 to 0.1.12
e1a07ce vmm: vm: Unpark the threads before shutdown when the current state is paused
1df38da vmm, tests: Make specifying a size optional for virtio-pmem
7481e4d vmm: config: Validate that shared memory is enabled if using vhost-user
2ac6971 vmm: MemoryManager: Cleanup the usage of std::ffi/io/result
3f42f86 vmm: Add the 'shared' and 'hugepages' controls to MemoryConfig
d6aa717 build(deps): bump syn from 1.0.17 to 1.0.18
3eaeba4 vm-virtio: Fix FS_IO callback for virtio-fs
df14a68 build(deps): bump smallvec from 1.3.0 to 1.4.0
e685854 gh: Separate the build and release jobs
c790bba tests: Migrate from Ubuntu Eoan to Focal
e525af7 build(deps): bump ryu from 1.0.3 to 1.0.4
3e8a6ba ci: Ignore test_snapshot_restore
9ebf052 build(deps): bump cc from 1.0.51 to 1.0.52
f6b150a ci: Add integration test for VM migration
9f08f53 build(deps): bump pin-utils from 0.1.0-alpha.4 to 0.1.0
9c7215d docs: Add the vhost-user-blk test doc
3574437 build(deps): bump cc from 1.0.50 to 1.0.51
4fc75cf vm-virtio: Implement Snapshottable trait for Console
d41ce90 vm-virtio: Implement Snapshottable trait for Pmem
f626bd6 build(deps): bump parking_lot_core from 0.7.1 to 0.7.2
5a380a6 vmm: memory_manager: Support non-power-of-2 block sizes
f8ee89a build(deps): bump arc-swap from 0.4.5 to 0.4.6
49322c5 vm-virtio: Implement the Snapshottable trait for Net
24c2b67 vm-virtio: Improve virtio-net rx queue processing
03dd249 vm-virtio: Restore queues based on used index
cf707da vm-virtio: Extend Queue helpers
c22fd39 vmm: Remove virtio device's userspace mapping on hot-unplug
0a97c25 vmm: Extend MemoryManager to remove userspace mappings
b2de1cd vm-virtio: Implement shutdown() for virtio-fs
fbcf3a7 vm-virtio: Implement userspace_mappings() for virtio-pmem
b035399 vm-virtio: Implement userspace_mappings() for virtio-fs
3fb0a02 vm-virtio: Get userspace mappings from VirtioDevice
8b823e5 build(deps): bump backtrace-sys from 0.1.35 to 0.1.36
c23b488 ci: Factorize virtio-fs hotplug integration tests
f68b08b tests: add integration tests for vm.add-fs route
18f7789 vmm: Add hotplugged virtio devices to the DeviceManager list
c2abadc vmm: Add ability to add virtio-fs device post-boot
bb2139a vmm/api: Add vm.add-fs route
d35e775 vmm: Update KVM userspace mapping when PCI BAR remapping
49cc73a vm-virtio: pci: Make sure to return the correct list of BARs
187b1ee vm-virtio: Implement the Snapshottable trait for Block
a484aa7 vm-virtio: Implement the Snapshottable trait for Rng
ac7178e vmm: Keep migratable devices list as a Vec
b6fdbf7 vm-virtio: Implement Snapshottable trait for MmioDevice
12fec55 vm-virtio: Add helpers to update queue indexes
fd45e94 vm-virtio: Add the ability to serialize a Queue
b7faf4f vhost_user_fs: Add the WRITE_KILL_PRIV write flag.
0870028 vhost_user_fs: Add the IOCTL_COMPAT_32 flag
592cfba vhost_user_fs: Add the EXPLICIT_INVAL_DATA capability flag
621ea83 vhost_user_fs: Add the ZERO_MESSAGE_OPENDIR capability flag
a2830da vhost_user_fs: Add the CACHE_SYMLINKS flag
926a414 vhost_user_fs: Add support for MAX_PAGES
747f31d vhost_user_fs: Add the ABORT_ERROR flag
5eb903a vhost_user_fs: Add support for FOPEN_CACHE_DIR
97e2d5d vhost_user_fs: Add support for CopyFileRange
b8cfdab pci: configuration: Use correct algorithm for BAR size reporting
9bd5ec8 pci, vfio, vm-virtio: Specify a PCI revision ID of 1 for virtio-pci
e7e0e8a vmm, devices: Add firmware debug port device
82d0cdf vhost_user_net: Simplify match values for handle_event()
a517be4 vhost_user_blk: Add multithreaded multiqueue support
13c8283 vhost_user_blk: Make everything private when possible
a31f5f8 vhost_user_blk: Move disk initialization to VhostUserBlkBackend
e78e34b vhost_user_blk: Make DiskFile sharable across threads
808586e vhost_user_blk: Simplify the code by removing VringWorker
ea82632 tests: Enhance test_pmem_hotplug to also unplug device
6389418 tests: Enhance test_disk_hotplug to also unplug device
f9a0445 vmm: vm: Remove device from configuration after unplug
444e5c2 vmm: device_manager: Generalise NoAvailableVfioDeviceName
5bab9c3 vmm: device_manager: Assign ids to pmem/net/disk devices if absent
514491a vmm: device_manager: Support unplugging virtio-pci devices
2fa652a vm-virtio: pci: Add virtio_device() accessor
476e4ce vmm: device_manager: Add virtio-pci devices into id to BDF map
b38470d vmm: config: Add "id" parameter to {Net, Disk, Pmem}Config
1beb62e vmm: vm: Don't panic on kernel load error
Fixes: kata-containers/runtime#2609
Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Runtime must setup the network before moving itself into the cgroup, otherwise
it won't be able to get the vhost/net queues file descriptors for the
hypervisor.
Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Update sandbox's device cgroup before hotpluggin a device and after it has
been removed from the VM, this way the device cgroup in the host is
fully honoured and the hypervisor will have access only to the devices needed
for the sandbox, improving the security.
Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
All the code related to HasCRIContainerType is useless and no longer needed
since the CRIContainerType annotation is not considered for constraining or
not the sandbox
Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Kata relies on the cgroup parent created and configured by the container
engine, but sometimes the sandbox cgroup is not configured and the container
may have access to all the resources, hence the runtime must constrain the
sandbox and update the list of devices with the devices hotplugged in the
hypervisor.
Fixes: kata-containers/runtime#2605
Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Peng Tao <bergwolf@hyper.sh>