Since we now have "unix://" kind of socket returned by the
SocketAddress() function, there is no more need to build the sandbox
storage path dynamically to keep OS compatibility.
Fixes: #2738
Suggested-by: Christophe de Dinechin <dinechin@redhat.com>
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
We now support any container engine CRI compliant. Let's bump the
kata-monitor version to 0.2.0.
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
This commit stops the container engine polling in favor of
the kata sandbox storage path monitoring.
The pod cache list is now refreshed based on fs events and synced with
the container engine only when needed.
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
When the container engine is different than containerd or CRI-O we
lack proper detection of kata workloads and consider all the pods as
kata ones.
Instead of querying the container engine for the lower level runtime
used in each pod, check if a directory matching the pod exists in
the virtualcontainers sandboxes storage path.
This provides a container engine independent way to check for kata pods.
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
Under certain circumstances[0] Kata will attempt to use SHPC hotplug
for PCI devices on the guest. In fact we explicitly enable SHPC on
our PCI to PCI bridges, regardless of the qemu default.
SHPC was designed a long, long time ago for physical hotplugging and
works very poorly for a virtual environment. In particular it has a
mandatory 5s delay to allow a (real, human) operator to back out the
operation if they press a button by mistake. This alone makes it
unusable for a fast start up application like Kata.
Worse, the agent forces a PCI rescan during startup. That will race
with the SHPC hotplug operation causing the device to go into a bad
state where config space can't be accessed from the guest at all.
The only reason we've sort of gotten away with this is that our
default guest kernel configuration triggers what's arguably a kernel
bug effectively disabling SHPC. That makes the agent rescan the only
reason we see the new device.
Now that we require a qemu >=6.1, which includes ACPI PCI hotplug on
the q35 machine, we can explicitly disable SHPC in all cases. It's
nothing but trouble.
fixes#2174
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
qemuArchBase.appendBridges is never actually used, because the bare
qemuArchBase type is itself never used (outside of unit tests). Instead
*all* the subclasses of qemuArchBase override appendBridges() to call
the very similar, but not identical genericAppendBridges. So, we can
remove the qemuArchBase.appendBridges implementation.
Furthermore, all those subclasses override appendBridges() in exactly
the same way, and so we can remove *those* definitions and replace the
base class qemuArchBase appendBridges() with that version, calling
genericAppendBridges().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
There are `DeviceToDeviceCgroup` and `deviceToDeviceCgroup` two functions,
creating a `specs.LinuxDeviceCgroup` object. We clear the new function `deviceToDeviceCgroup`.
Fixes: #2694
Signed-off-by: wangyongchao.bj <wangyongchao.bj@inspur.com>
A discussion on the Linux kernel mailing list [1] exposed that virtiofsd makes a
core assumption that the file systems being shared are not accessible by any
non-privileged user. We currently create the `shared` directory in the sandbox
with the default `0750` permissions, which gives read and directory traversal
access to the group. There is no real good reason for a non-root user to access
the shared directory, and this is potentially dangerous.
Fixes: #2589
[1]: https://lore.kernel.org/linux-fsdevel/YTI+k29AoeGdX13Q@redhat.com/
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
Retrieve the absolute sandbox storage path. We will soon need this to
monitor the creation/deletion of new kata sandboxes.
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
The storage path we use to collect the sandbox files is defined in the
virtcontainers/persist/fs package.
We create the runtime socket in that storage path, by hardcoding the
full path in the SocketAddress() function in the runtime package.
This commit splits the hardcoded path by the socket address path so that
the runtime package will be able to provide the storage path to all the
components that may need it.
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
In order to retrieve the list of sandboxes, we poll the container engine
every 15 seconds via the CRI. Once we have the list we have to inspect
each pod to find out the kata ones.
This commit extend the sandbox cache to keep track of all the pods,
marking the kata ones, so that during the next polling only the new
sandboxes should be inspected to figure out which ones are using the
kata runtime.
Fixes: #2563
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
this is an unexpected event (likely a change in how containerd/cri-o
record the lower level runtime in the pod) and should be more visible:
raise the log level to "warning".
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
Change logger in Trace call in newContainer from sandbox.Logger() to
nil. Passing nil will cause an error to be logged by kataTraceLogger
instead of the sandbox logger, which will avoid having the log message
report it as part of the sandbox subsystem when it is part of the
container subsystem.
The kataTraceLogger will not log it as related to the container
subsystem, but since the container logger has not been created at this
point, and we already use the kataTraceLogger in other instances where a
subsystem's logger has not been created yet, this PR makes the call
consistent with other code.
Fixes#2665
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Call StopTracing with s.rootCtx, which is the root context for tracing,
instead of s.ctx, which is parent to a subset of trace spans.
Fixes#2661
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
A random generated user/group is used to start QEMU VMM process.
The /dev/kvm group owner is also added to the QEMU process to grant it access.
Fixes#2444
Signed-off-by: Feng Wang <feng.wang@databricks.com>
This patch adds the configuration option that allows to use hugepages
with Cloud Hypervisor guests.
Fixes: #2648
Signed-off-by: Bo Chen <chen.bo@intel.com>
We recently updated to using qemu-6.1 (from qemu 5.2). Unfortunately one
breaking change in qemu 6.0 wasn't caught by the CI.
The query-cpus QMP command has been removed, replaced by query-cpus-fast
(which has been available since qemu 2.12). govmm already had support for
query-cpus-fast, we just weren't using it, so the change is quite easy.
fixes#2643
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The new API is based on containerd's cgroups package.
With that conversion we can simpligy the virtcontainers sandbox code and
also uniformize our cgroups external API dependency. We now only depend
on containerd/cgroups for everything cgroups related.
Depends-on: github.com/kata-containers/tests#3805
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Eventually, we will convert the virtcontainers and the whole Kata
runtime code base to only rely on that package.
This will make Kata only depends on the simpler containerd cgroups API.
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
The only process we are adding there is the container host one, and
there is no such thing anymore.
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
This is a simplification of the host cgroup handling by partitioning the
host cgroups into 2: A sandbox cgroup and an overhead cgroup.
The sandbox cgroup is always created and initialized. The overhead
cgroup is only available when sandbox_cgroup_only is unset, and is
unconstrained on all controllers. The goal of having an overhead cgroup
is to be more flexible on how we manage a pod overhead. Having such
cgroup will allow for setting a fixed overhead per pod, for a subset of
controllers, while at the same time not having the pod being accounted
for those resources.
When sandbox_cgroup_only is not set, we move all non vCPU threads
to the overhead cgroup and let them run unconstrained. When it is set,
all pod related processes and threads will run in the sandbox cgroup.
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
Regardless of the sandbox_cgroup_only setting, we create the sandbox
cgroup manager and set the sandbox cgroup path at the same time.
Without doing this, the hypervisor constraint routine is mostly a NOP as
the sandbox state cgroup path is not initialized.
Fixes#2184
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
Sync the virtcontainers api.md document, add `ConfidentialGuest` `EntropySourceList` `GuestSwap` three
fields to the HypervisorConfig API.
Fixes#2625
Signed-off-by: wangyongchao.bj <wangyongchao.bj@inspur.com>
sync the virtcontainers api.md document, add SandboxBindMounts field to the SandboxConfig API.
And update the order of the SandboxConfig API fields.
Fixes#2621
Signed-off-by: wangyongchao.bj <wangyongchao.bj@inspur.com>